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PREFACE 

G REGOR. MENDEL did not use probability 
methods in the analysis of his classical 
experiments, but after the rediscovery of his 
••work, at the beginning of this century, the 
application of such a technique was soon recog- 
nized as a necessary part of genetical analysis. 
The problems at first were chiefly involved with 
the testing of the significance of departures from 
expected ratios, but, with the discovery of link- 
age, questions of estimation became equally 
important. The absence of a suitable statistical 
technique is one of the prime causes of the failure 
to understand linkage and linkage groups during 
the years before the Drosophila technique of 
backcrossing was extensively practised, and the 
true relations of coupling and repulsion realized. 
Since that time statistical methods have become 
more and more favoured by genetical workers 
and are to-day more than ever necessary for 
genetical analysis. 

Now statistics, like genetics, is a growing 
Science and the crude and often arbitrary methods 
of yesterday have been superseded and rendered 
redundant by the development of more exact 
knd more adaptable techniques. This is largely 
a result of the work of R. A. Fisher and his 
associates. In some branches of biology, as for 
^example agronomical experimentation, such re- 
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fined statistical methods are now in common use 
for several reasons, not the least important of 
which is the existence of articles and text- 
books giving details of their application. In 
genetics these methods have been less fully 
described and applied. It is hoped that the 
present work will serve to bring to the notice of 
geneticists the desirability of employing such 
methods and provide the necessary instruction 
for their use. 

It is not claimed that this book is complete. 
In fact some statistical devices such as the 
analysis of variance and the use of regressions 
have been entirely omitted as they are, at pre- 
sent, of secondary importance to the geneticist. 
On the other hand, the uses of % 2 and of maximum 
likelihood have been dealt with in considerable 
detail as they are of wide application in this 
branch of science. The methods are described 
in general and some specific applications are 
discussed in detail. No attempt has been made 
to prove all the general formulae used, though 
some have been considered in detail. Such a 
course is beyond the scope of this work and 
would, in any case, merely result in confusion 
for a non-mathematical reader. If desired, proofs 
may be found in the original literature cited. 
But numerical examples have been used wherever 
possible in the hope that the reader, by working 
through them, will familiarize himself with the 
methods and be able to apply them to other 
problems. The list of general formulae in the last 
chapter should make reference to any method, 
and its use for other analyses, a simple matter. 
It cannot be overemphasized that : in order to 
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make full use of the book the actual examples 
used should be worked through in detail, as it is 
only by this means that they will be fully under- 
stood and appreciated. 

For further details of the genetieal and cyto- 
logical principles, which have been assumed here 
in order to permit fuller development of the 
statistics, the reader is referred to Mendelism 
and Evolution , by E. B. Ford and to The Chromo- 
somes, by M. J. D. White, both of which are in 
this series of Biological Monographs. 

I am indebted to Professor R. A. Fisher, Head of 
the Galton Laboratory, for permitting the repro- 
duction herein of a number of tables mainly from 
the Annals of Eugenics, and for continued encour- 
agement during the writing of this book, and to 
Messrs. W. J. C. Lawrence and J. C. Cullen, of 
the John Innes Horticultural Institution, for 
reading and criticizing the manuscript. I also 
wish to thank Messrs. Oliver and Boyd for allow- 
ing me to reproduce Tables I and II, from R. A. 
Fisher’s Statistical Methods for Research Workers, 
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CHAPTER I 


INTRODUCTORY 
1. GENETICAL 

T HE science of Genetics is based on the experi- 
ments of Gregor Mendel. His hypothesis of 
particulate inheritance is the foundation of modern 
genetical theory and his experimental technique of 
observing the occurrence and extent of segregations 
for single factor differences is the basis of modern 
genetical method. 

The field of research covered by Genetics to-day 
is very large and very varied. The studies of 
chromosome behaviour, heterozygosity of wild popu- 
lations and the physiology of gene' action, to name 
but three examples, are concerned with largely 
different problems and involve largely different 
techniques. All are, however, alike in involving 
the study of gene differences. Without the observa- 
tion and analysis of segregations none of these fields 
would be open, although other and non-genetical 
techniques, cytological, anatomical or physiological, 
could perhaps be used. The genetical method of 
investigation breaks down unless suitable genes, 
each comprising two allelomorphs, can be found, and 
their segregations observed. 

Each line of work has commenced from single 
factor segregations and has, in developing, uncovered 
more complex interactions and dependencies of the 
single genes in inheritance. These more complex 
mechanisms, once understood, have provided tools 



2 MEASUREMENT OF LINKAGE IN HEREDITY 

for the analysis of still further and still more recondite 
situations. For example, the detection of dependent 
segregation of two or more genes has led to the 
analysis of the organization of linkage groups. The 
use of linkage as a research tool has in its turn permit- 
ted the development of genetical studies on crossing- 
over and now shows promise of being a powerful 
agent in the analysis of the complexities of heritable 
quantitative differences. 

Thus any piece of genetical research, in being 
based on single factor segregations, requires initially 
a consideration and analysis of the single genes con- 
cerned. It is necessary to isolate and identify the 
genes which are segregating in the material. In 
some cases, e.g. Drosophila melanogaster, Primula 
sinensis, the genes may be well known from past 
work. This is, however, not always the case and 
often the primary analysis essential to the future of 
the research is that of the genes involved. 

There are, at least superficially, two different 
methods of testing the hypothesis that a given 
distinction between two types is controlled by a 
single factor. The first is to show that, in a diploid, 
only three genotypes exist for this factor, that two 
of them are pure breeding and that only one shows 
segregation at gametogenesis for the difference 
between the two postulated allelomorphs. Segre- 
gation is the occurrence of two kinds of gametes 
distinguishable by their capacity for producing 
distinct genetical types (e.g. the production of AA 
as opposed to Aa or of Aa as opposed to aa) when 
mated with any given single type of gamete. The 
differences observed in the progeny of such an indivi- 
dual whose gametes show segregation are also referred 
to as segregations. 

Tests of this first type are perhaps the most con- 
vincing, but are sometimes not immediately available, 
and, in any case, are often preceded as evidence by 
the second type of data, viz. that of the numerical 
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relations of the classes in a segregating progeny. In 
the case of a single factor there exists but one type, 
the heterozygous or Aa type, capable of producing 
two different (A and a) gametes. Furthermore, the 
hypothesis states that it will produce such gametes 
in equal numbers. Then on backcrossing such a 
heterozygote to a pure recessive, i.e. Aa x aa, a 1 : 1 
segregation for types Aa : aa should be obtained. 
On intercrossing two heterozygotes an F 2 ratio of 
3 : l for AA + Aa : aa, or 1 : 2 : 1 for AA : Aa : aa 
in the absence of dominance, should be found. These 
are the only segregations possible for an uncomplicated 
single factor difference. We can now ask the ques- 
tion, 4 Are the observed ratios in agreement with 
these expectations ? ’ 

If the segregations are tested by tetrad analysis, 
i.e. the testing of all four gametes which are the 
products of any meiotic division, such as is possible 
in some lower plants, exact 1 : 1 gametic segregation 
should be observed. This is, however, not always 
the case in the progeny of heterozygotes, as normally 
obtained. Chance deviations from expectation can 
occur. Then the use of this method of testing the 
single factor hypothesis automatically involves con- 
sideration of chance deviations in the segrega- 
tions. 

Other hypotheses, e.g. of two complementary 
factors or of a single gene with one homozygous 
(AA or aa) form lethal, can be tested in both of 
these ways. It can be shown that more than two 
pure breeding types exist in the one case and that 
but one such type exists in the other. It can also 
be shown that an F 2 ratio of 9 : 7 and a backcross 
ratio of 1 : 3 can be obtained for two complementary 
factors and that the lethal single gene gives 2 : 1 
and 1 : 1 segregations in F 2 and backcross respectively. 
Precisely the same principles are involved as in 
the testing of the hypothesis of a simple single 
gene. 
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The other situation often met with and also often 
important in genetical analysis is that of linkage. 
In some families it may he necessary to distinguish 
linkage and factor interaction, or it may be necessary 
to have exact knowledge of the linkage relations of 
two genes prior to their use in the analysis of quanti- 
tative characters, or the information may be necessary 
for a number of other reasons. 

Now, linkage can only be observed in families 
segregating for each of the two characters observed, 
and, furthermore, only if at least one parent is doubly 
heterozygous. For example, a cross of the type 
Aabb X aaBb gives no information about linkage. 
If the two genes are not linked the gametic segrega- 
tion of the double heterozygote will be AB, | Ab, 
l aB, l ab, and this may be realized in tetrad 
analysis. Among the progeny of a cross it may not 
be found exactly, because chance deviations will 
again occur. 

If linkage exists between the two genes, the gametic 
output of the double heterozygote is £(1 — p) AB, 
\p Ab, \p aB, -1(1 — p) ab, or \p AB, J(1 — p) 
Ab, i(l — p) aB, \p ab, where p is the recombina- 
tion fraction and has the value 0*5 when there is no 
linkage. If after considering chance deviations and 
other complications p is demonstrably different from 
0*5 the evidence for linkage is clear. 

There is another test for the presence of linkage. 
In a double heterozygote two relations in arrange- 
ment may exist between the genes. Either A — B 
and a — b may be on the same chromosomes or 
A — b and a — B may be the arrangement found. 
In the former case A — b and a — B are the re- 
combination types. In the latter they are A — - B 
and a — b. If two different arrangements, bearing 
these relations, can be shown to exist, linkage is 
demonstrated. These arrangements are differenti- 
ated under the names of coupling and repulsion , the 
arrangement to which a particular name is allocated 
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being dependent on conventions varying with differ- 
ent organisms and different schools of research. 

AB 

Usually, however, coupling is the type and 
repulsion the ^ type. 

When linkage is once detected, by its causing 
characteristic deviations from the 1 : 1 : 1 : 1 expected 
in the double backcross (AaBb x aabb) or the 
3 : 1 : 3 : 1 expected from the single backcross 
(AaBb x aaBb or AaBb x Aabb) or the 9 : 3 : 3 : 1 
of the F 2 (AaBb x AaBb), it is usually necessary 
to measure it. The common measure, developed 
from the chromosome theory of heredity, is by the 
calculation of the recombination fraction or value, 
denoted above by p. The value is a measure, though 
not always a simple one, of the frequency of crossing- 
over between the two chromosomes in the region 
delimited by the two genes under consideration. 

It is in the consideration of the chance variations 
from expectation, such as has been shown to occur 
at almost every stage of the genetical mechanism, 
that statistical methods are necessary. 

2. STATISTICAL 

It is clear that the 3 : 1 and 1 : 1 segregations of 
single factor differences will seldom be realized 
exactly, because the individuals of the progeny 
represent samples from a large population of gametes, 
half of which carry one allelomorph and half the 
other. Similarly, recombination values of 50 per 
cent will not be exactly realized even though the 
genes are carried by different chromosomes, because, 
again, the progeny represents a series of samples 
from a population of gametes of which half are 
recombinations. The differentiation of such chance 
fluctuations from real deviations requires a test of 
significance of the observed departure from the 
expected ratio. 



6 MEASUREMENT OF LINKAGE IN HEREDITY 

The principle underlying such a test of significance 
is simple but must be grasped clearly. The results 
observed are compared with those expected on the 
basis of the hypothesis under consideration. The 
probability of obtaining by chance a departure from 
expectation at least as large as that found is cal- 
culated, and if this probability is sufficiently small it 
is concluded that the departure is significant. What 
constitutes 4 sufficiently small 5 is dependent on 
circumstances. If a single family segregates in such 
a way that its departure from expectation would be 
equalled or exceeded by chance in but one trial out 
of twenty, it is usually considered to be showing a 
significant deviation from the hypothesis. But if 
one family out of twenty was showing such a deviation 
it could not be considered as indicating significant 
deviation, because one family out of twenty is 
expected to do so by chance. The second case 
differs from the first in that wc have had twenty 
trials before finding a deviation of this magnitude. 
In such a case it is expected. In the first case of 
only one family it would be a relatively remote 
contingency. The test of significance must be 
capable of dealing with such contingencies as these- 

An hypothesis can never be proved or disproved 
by a test of significance. If the data do not show 
a significant deviation from expectation they agree 
with the hypothesis, but they may also agree with 
several other hypotheses giving closely similar ex- 
pectations. The simplest or most relevant hypo- 
thesis is considered and is not discarded if the 
data agree with it, irrespective of how many more 
complicated hypotheses are also in agreement with 
observation. 

If the data show a high deviation from the expected 
segregation they do not generally disprove the 
hypothesis ; they only make it a more or less unlikely 
one. In the case considered above, when only one 
family was grown, a deviation which would be 
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exceeded or equalled once in twenty trials was found. 
The hypothesis is then rendered unlikely as it could 
account for such a family only once out of twenty 
times. When twenty families are grown it can 
account for one such family in each trial and is not 
unlikely. The level of probability chosen as indicat- 
ing significant departure from hypothesis is simply 
the level at which the worker is willing to be misled. 
If, as is usual, the one in twenty level is taken, he 
will find that his supposedly real departures are 
actually chance ones, once in twenty cases. If the 
one in a hundred level is taken he will be wrong, in 
calling the departure real, less often, but if a hundred 
such cases are taken he must expect to be wrong 
once. If this is constantly borne in mind the experi- 
menter will set his levels of significance to suit his 
circumstances and will not be disconcerted when an 
apparently promising line of work comes to nothing 
because it was based on a false conclusion as a result 
of his test of significance misleading him. 

The only exception to this rule, in genetics, is 
when segregation is observed to occur in what should 
be, by the hypothesis, a homozygote. Even this 
cannot be considered as a complete exception, as 
the ‘ segregation ’ could be the result of mutation, 
or error in the handling of the material. 

When the first and simplest hypothesis has been 
shown to be unlikely, another may be set up and 
tested. The new hypothesis may involve a para- 
meter, a numerical quantity characterising the 
population, which must be estimated, e.g. the sup- 
position that two genes are linked demands that an 
estimate of the recombination value be obtained 
before the hypothesis of linkage is sufficiently precise 
to be tested by observation. This involves the use 
of a method of estimation. When the hypothesis has 
been formulated precisely it may, in its turn, be 
tested against the observational data by a new test 
of significance. 

9 , 
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Both, tests of significance and methods of estima- 
tion rest on considerations of frequency distributions. 
A frequency distribution gives the relative frequencies 
with which certain events will occur, or individuals 
be found to fall into certain classes. As an example 
let us consider the segregation in a single factor 
backcross. 

Any individual in a family showing a backcross 
segregation is equally likely to have arisen from the 
union of an A gamete or of an a gamete from the 
heterozygous parent, with one of the gametes (all a) 
from the recessive parent. We may then represent 
the frequencies of one individual falling into the 
classes Aa and aa as \ : J. A second individual is 
also equally likely to be Aa or aa and its character 
will be independent of that of the first one. Then 
both will be Aa in \ x \ of cases and both will be 
aa in | x | of cases. One will be Aa and the other 
aa in 2 X | X l of cases as this type of family may 
occur in either of two ways, viz. the first individual 
may be Aa and the second aa or the second Aa 
and the first aa. The frequencies with which the 
three types of family will occur are thus : — both 
Aa £, one Aa and one aa and both aa 
a A similar argument leads to the conclusion that 
families of three individuals will show 3, 2, 1 , and 0 
Aa individuals in 1/8, 3/8, 3/8, and 1/8 of cases 
respectively. 

It will be observed that these frequencies for 
families of one, two and three individuals are given 
by the expansions of the binomial expressions 
(i + i) 1 , (i + |) 2 and (| + i) 3 respectively. The 
general form for a family of n individuals is (-| + \) n . 
We can calculate the expected frequencies of the 
various types of family of size n by expanding this 
formula. 

Suppose that we have a family of eight individuals 
expected to be segregating in the ratio 1:1. The 
frequencies of families with 8, 7, 6, 5, 4, 3, 2, 1, and 
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0 Aa individuals will be, from the expansion of 

(I + i) 8 - 

No. of Aa individuals 876543210 

J__8__28_56__70__56__28^_8 1_ 

Frequency . . 2g0 2g6 2g0 256 256 256 2g6 256 2g6 

If we observe that a family actually obtained 
shows 7 Aa and 1 aa members, we can apply a test 
of significance of the deviation from the expected 
4 : 4 by the use of the above frequency distribution. 
The deviation is 3 from expectation in each class. 
Now deviations of 3 or more will occur when families 
of 8 : 0, 7 : 1, 1 : 7, or 0 : 8 are found. These families 
are expected with the frequencies jtsj tItj 

and respectively. Then the total expectation 
of obtaining as large or larger deviation than the one 

observed is ^ i.e. or 0*070. 

This is slightly greater than 7 per cent. It is generally 
considered that a deviation should not be taken as 
significant until the probability of obtaining it by 
chance is less than 5 per cent, and so the hypothesis 
can, in this case, be accepted as agreeing sufficiently 
well with the data. 

If we had been expecting a segregation of 3 : 1, or 
| : J, the correct binomial for a family of n would 
have been (f + The general form for a ratio of 
x : y , where y — 1 — x, is (x 4- y) n and the general 
term of the expansion giving the frequency of families 
with r individuals of one class and n — r of the 
other is 

While it is always possible to do a test of significance 
in this way, it is not always convenient to calculate 
the binomial expansion, particularly when n is large. 
A quicker technique is needed. Now the probability 
of a deviation being equalled or exceeded by chance 
is expressible as a function of the deviation divided 
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by a quantity called the standard error. This quan- 
tity, denoted by a, is really a measure of the spread 
of the frequency distribution and is calculated from 
the formula a = V xyn . 1 The probability of the 
deviation being equalled or exceeded is tabulated 
against the deviation : standard- error ratio for ease of 
use. It can be calculated for any particular example, 
but is more readily available in tabular form. The 
important point about this method is that for large 
samples the probability corresponding to any given 
ratio of deviation and standard error is constant no 
matter what the mean and standard deviation of the 
distribution may be. The limitation of the method 
is that it assumes a continuous distribution, whereas 
the binomial is really discontinuous. On the other 
hand, as n increases the discontinuity becomes less 
and less important, and may be neglected for quite 
low values of n. Even where discontinuity does 
seriously affect the result it may be corrected quite 
easily, as will be seen later. The standard error 
technique is based on the use of the normal distribu- 
tion which is the limit reached by the binomial dis- 
tribution when n is infinite. This technique has 
been very popular with geneticists in the past ; the 
ratio of deviation to standard error being used under 
the symbol of d/m . 

Other quantities calculated from the deviation and 
the expectation can be used in the same way, when 
their relations to the probability have been deter- 
mined and tabulated. One in particular, % 2 , is of 
great value as it is additive, i.e. the sum of two 

1 It is customary to denote a parameter by a Greek letter 
and the corresponding statistic, or estimate of the parameter, 
by a corresponding Latin letter. Thus the standard error 
of the binomial expansion ( x -b y) n , where x and y are fixed 
by hypothesis is not an estimate and is denoted by <r. The 
standard error of this expansion if x were estimated would 
be itself an estimate of the true standard error a and should 
be denoted by s. This convention will be followed with all 
the symbols used. 
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independent % 2 quantities is itself a yf and may be 
used as a joint test of significance. 

Methods of estimation are also related to the 
binomial expansion. This expansion is itself a 
special case of the multinomial (m x -f- m 2 + m 3 . . .) n 
whose general term is 


n\ 

a x \a 2 \a z \ . 




where m 1 + -f m 3 , &c., is 1 and a x + a 2 + a 3 , &c., 
is n. 

The method of maximum likelihood which has the 
property, unique among methods of estimation, of 
always extracting the most precise estimate which 
the data can yield, is based on this multinomial 
expansion. Just as we could express the chance of 
finding , in a family of n , r of one type and n — r of 
the other, the expectation of any individual falling 

fi f 

in the first class being x, as ‘ r y {%Y(y) n ~ r so we 

can express the chance or likelihood of finding a 
family of n individuals with a x of the first kind, a 2 of 
the second, and so on, by the general term of the 
multinomial above. Then when m 1} m 2 , &c., are 
known in terms of the parameter we wish to estimate, 
e.g. when they are the expectations of the four classes 
of a backcross for two linked factors expressed as a 
function of p the recombination value, this term of 
the multinomial is the chance or likelihood of finding 
such a family, expressed in terms of what we want 
to measure. That value of the variable which 
makes this likelihood a maximum, is then found and 
is taken as the best estimate of the parameter. 

Now the estimate of a parameter, or statistic as 
it is termed, derived by some such process, will 
deviate from the true value of the parameter as a 
result of sampling variation, just as families expected 
to give a 1 : 1 ratio deviate from this ratio by sampling 
error. For example, if in a backcross we found 15 
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of one class and 20 of the other, an estimate of the 
frequency of gametes ■which give rise to the first kind 
of individual would be -2. This is not but it 
must be considered as an estimate of because the 
family is in keeping with the backcross expectation, 
the deviation being ascribable to pure sampling 
error. Thus we require some measure of the confi- 
dence which can be reposed in an estimate of some 
parameter. This is given by the standard error of 
the estimate, or very often by the variance, which 
is the square of the standard error. The variance 
and standard error are measures of the spread of the 
distribution of the estimate round its true value, the 
parameter, and so are measures of the precision with 
which the estimate is made. The method of maxi- 
mum likelihood always has the maximum precision 
possible as measured by this means. 

These principles, outlined above for simple cases, 
are the bases of all methods of analysing genetical 
data. The frequency distribution, or such statistics 
as specify it, of some quantity or quantities are cal- 
culated and are used in the test of the hypothesis. 
Complications may be introduced by complexities or 
shortcomings of the data, and the analysis may need 
to be complicated to accommodate such data, but 
the essentials of the methods remain the same. 
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TWO CLASS SEGREGATIONS 

3. DEVIATION AND EETEROGENEITY 

A NALYSIS of the single factor segregations is the 
first consideration not only because of the 
interest which may attach to them themselves, but 
also because the subsequent treatment of the data 
will depend to some extent on the nature of these 
single factor ratios. 

Two questions may be asked about the segregation 
of a single factor : (a) Is it in keeping with some 
expected ratio, e.g. 3:1 or 1:1? ( b ) Are all the 
families in agreement in showing the same result, 
i.e. are the data homogeneous ? The answers to 
both questions are provided by suitable tests of 
significance. 

4. TEE USE OF THE STANDARD EEE0E 

One of the most popular tests of significance used 
in detecting deviations from expected ratios is that 
based on the standard error, discussed in the previous 
chapter. The standard error of a binomial distri- 
bution, (x + y) n is given by the formula 

a x = rOH where x -f y = 1 
\J n 

This is the standard error appropriate to testing the 
agreement of that observed proportion of the family 
which falls into one class with its expected value 
13 
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x or y. If we want to test the agreement of the 
actual number of individuals in one class with the 
number expected we use the standard error 



The procedure, as noted before, is simple. The 
standard error is found, whether for proportion or 
number, and the corresponding deviation from expec- 
tation is found for either the proportion or number 
observed to be in one class. The deviation is divided 
by the standard error and then by the use of a 6 Table 
of Normal Deviates such as is provided at the end 
of this book (Table I), the corresponding probability 
of obtaining as large or larger deviation by chance, 
is obtained. The procedure may be illustrated by 
an example. 

Ex. 1. In a family of Antirrhinum majus, obtained 
by selfing a yellow-flowered plant known to be 
heterozygous, the following segregation for flower 
colour was observed. 

Yellow -flowered plants .... 208 

Ivory ' 81 

Total in family . . . . .289 

Is this in keeping with the 3 : 1 ratio expected from 
selfing a plant heterozygous for a single flower colour 
gene ^ 

When a 3 : 1 ratio is expected the frequencies of 
families of 289 individuals, with 289, 288, &c. yellow 
plants will be given by the expansion of (f + J) 289 . 

The number expected in each of the two classes will 
be § x 289 and J x 289 respectively, i.e. 216*75 
yellows and 72*25 ivories. Then the deviation of the 
number observed in each class from expectation is 
216*75 — 208, i.e. 8*75. The standard error of the 
number in each class is Vf x _ i X 289, i.e. 7*36. 
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Hence the ratio of the deviation to the standard 
error is 


d 

a 


8-75 

7-36 


1-19 


Reference to Table I shows that such a deviation, 
1T9 times the standard error in size, is expected by 
chance about once in four trials, or, in other words, 
has a probability of about 0-23. If this difference 
were to be considered as indicating real deviation 
from the hypothesis, then a false conclusion would 
have been reached once in every four cases. This 
is too great a proportion of errors, and so the data 
must be considered as agreeing sufficiently well with 
the hypothesis. In general, no probability greater 
than 0*05, i.e. one in twenty, should be considered 
as indicating significant deviation. 

The standard error is of use and is easy to apply 
to such cases as the above where only one family is 
concerned, and this family segregates into but two 
classes. The standard error is, however, not easy to 
use for testing for deviations from hypotheses when 
more classes than two are observed, nor is it easily 
adaptable to the testing of agreement among several 
families. Tor these purposes Pearson’s y 2 is much 
to be preferred. 


5. TESTING DEVIATIONS BY y 2 
y 2 is calculated from the general formula 

_ ;S ' p - mn)i ~ 

/w I mn 

where a is the observed and m the proportion expected 
in a class, n is the total and S stands for summation 
over all classes. This quantity is just as simple to 
use for testing deviations in a single family segre- 
gating into but two classes, as is the standard error. 

Ex. 2. We may illustrate its simple use in this 
way, by considering the Antirrhinum data quoted in 
the last example. The setting out of the data and 
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the calculation of % 2 testing the deviation from the 
expected 3 : 1 is shown in Table 1 . 


TABLE 1 


Class 

Number j 

Number 

Deviation 

a /(rt — mn)\ 
* \ mn ) 

Observed (a) | 

Expected (mn) 

(a — mn) 

Yellow . 

208 

216-75 

- 8-75 

0-353 

Ivory 

81 

72-25 

-f 8-75 

1-056 

Total 

289 

289-00 

0-00 

1-409 


There is one further determination to make before 
yf can be entered in the corresponding table of proba- 
bilities, viz. the number of 4 degrees of freedom 9 to 
which x 2 corresponds. The rule for determining the 
number of degrees of freedom is simply stated as 
1 the number of degrees of freedom is the number of 
classes which can be filled arbitrarily \ This and 
subsequent examples will amply illustrate the use of 
this rule. In the present case only one class could 
be filled arbitrarily, for once a number were assigned 
to the yellow class the number in the ivory class 
would follow, because it is the total minus the number 
of yellows. We have, then, one degree of freedom. 

The table of probabilities (Table II) given at the 
end of this chapter is taken from Fisher (1936a). 
The probability of obtaining as large or larger devia- 
tion is given at the head of the table, and each row 
of the table corresponds to a number of degrees of 
freedom as shown in the leftmost column. The body 
of the table contains the yf values. To use the table 
we note that we have one degree of freedom, and so 
must use the first row. Then our value of yf, 1-409, 
lies between those values whose probabilities are 0*3 
and 0-2. This, is in agreement with the standard 
error test, as indeed it must be if the tests are both 
suitable and both calculated correctly. 1 It is un- 
necessary to know the probability with any further 

1 For one degree of freedom x 2 — 
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accuracy as it cannot be considered as indicating 
a significant deviation unless as low as 0*05. 

6. AS A TEST OF HETEROGENEITY 

The above simple example fails to bring out the 
really valuable property of y 2 , which is its additive 
character. The sum of two y 2 s is itself a y 2 for a 
number of degrees of freedom obtained by adding the 
two numbers of degrees of freedom corresponding to 
the initial ^ 2 s. It is this additive property which 
allows of the easy testing of homogeneity, as illus- 
trated by the next example. 

Ex. 3. Fisher and Mather (1936) give the results 
of a backcross for several factors in mice. Table 2 
shows the segregations observed for the genes D,d 
(intense -dilute coat colour) and Wv,wv (straight- 
wavy hair) in the five groups into which this back- 
cross is divided. The totals of the intense and 
dilute, and straight and wavy animals for the whole 
backcross are shown at the bottom. In each case 
there is a shortage of recessives from the expected 
half. Are these shortages of recessives significant 
and are the families homogeneous ? 

TABLE 2 

Segregations for the factors D,d and Wv,wv in a 

MOUSE BACKCROSS 


Group 

D 

d 

x 2 

Wv 

wv 

x 2 

1 . . . 

219 

211 

0-1488 

209 

221 

0-3349 

2 

174 

137 

4-4019 

169 

152 

0-9003 

3 . . . 

96 

72 

3-4286 

91 

82 

0-4682 

4 . . . 

31 

28 

0-1525. 

36 

23 

2-8644 

5 . . . 

128 

123 

0-0996 

134 

117 

1-1514 




8-2314 



5-7192 


648 

571 

4-8638 

639 

595 

1-5689 


x 2 

D.f. 

P 

x 2 

D.f. 

P 

Deviation 

4-864 

1 

0-05 - 0-02 

1-569 

1 

0-3 - 0-2 

Heterogeneity 

3-367 

4 

0-5 

4-150 

4 

0-5 - 0-3 

Total . 

8-231 

5 


5-719 

5 
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Let us first consider the Wv,wv segregation. 
for the deviation from the expected 1 : 1 segregation 
is calculated by the method of the previous example, 
for each group separately. For example, the first 
group of 430 mice is expected to show 215 Wv and 
215 wv mice. It actually has 209 Wv and 221 wv. 
Then 

= (Mil!, 0.3349 

Each group yields a y? for one degree of freedom. 
Hence the sum of these five ^ 2 s, 5*7192, is itself a y 2 
for five degrees of freedom. This total y 2 may be 
considered as comprising two parts, (a) a portion 
concerned with the grand deviation of all the groups 
taken together from the expected 1:1, and ( b ) a 
portion concerned with the disagreement among the 
groups when allowance has been made for the grand 
deviation. Now the former portion may be calcu- 
lated from the totals of straight and wavy mice in 
all the groups taken together. It is fo.und to be 
1*5689 by precisely the same type of calculation 
as for the single groups, and will have one degree of 
freedom because it is concerned with a distinction 
into two classes. The difference of the total y 2 for 
five degrees of freedom and this y 2 for one degree of 
freedom, calculated from the combined segregation, 
will be the second or heterogeneity y 2 testing the 
agreement between the five groups. This hetero- 
geneity y 2 must have 5 — 1, i.e. four degrees of 
freedom. Thus we get the analysis of y 2 into its 
two parts, testing deviation from 1 : 1 and hetero- 
geneity among groups respectively, as shown below 
Table 2. Neither y 2 when referred to Table II has 
a significantly low probability, and so we may say 
that the groups agree (a) with the expected 1 : *1 
segregation and ( b ) with one another. The latter 
agreement considerably increases confidence in the 
value of the former agreement. 
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The data for D ,d are analysed in a precisely similar 
manner. In this case, however, the 4 deviation s y 2 , 
calculated from the total segregation, is 4*864 for one 
degree of freedom. This y 2 is found from Table II 
to have a probability of between 0*05 and 0*02. Such 
a large deviation could be obtained by chance less 
than once in twenty trials, and so should be considered 
as indicating a real shortage of dd mice. The 
4 heterogeneity ’ y 2 , obtained as before, by sub- 
traction is 8*231 — 4*864, i.e. 3*367 for four degrees 
of freedom and has a probability of 0-5. The groups 
thus agree with one another in showing a shortage 
of dd individuals. 

The agreement among the groups settles any doubts 
as to the reality of the dd shortage. It removes the 
suspicion that the shortage is due to faulty experi- 
mental technique. 

It may be noted here that, in general, hetero- 
geneity, if established, is often a direct result of 
faulty experimentation. It may flow from poor 
classification or partial selection of stronger types by 
overcrowding or insufficient feeding, and from other 
similar causes. With inexplicably heterogeneous 
data the whole experiment and its technique is 
suspect. With absence of significant heterogeneity, 
as in the above example, the validity of the deviations 
is not called into question. 

One further point must be made about the last 
■example. The total segregation of Dd : dd mice did 
not agree with the expected 1:1, and yet the hetero- 
geneity y 2 was calculated on the assumption that 
observation and expectation did not disagree. Hence 
the heterogeneity y 2 obtained by subtraction is not 
absolutely accurate. In the actual practice, how- 
ever, it needs a considerably greater deviation of 
total segregation from expectation seriously to 
invalidate a heterogeneity y 2 calculated in this way. 
In this example the true value, calculated from the 
^observed segregation of 648 Dd : 571 dd is 3*381 
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whereas the value obtained by assuming a 1:1 
segregation was 3*367. This difference would cause 
no serious misjudgement. Where, however, the devia- 
tion of the total segregation is more serious, a more 
trustworthy method of calculating the for hetero- 
geneity must be used, in order to avoid the liability 
of serious misjudgement. Such a case is also pro- 
vided by Fisher and Mather’s mouse backcross. 

7. HETEROGENEITY WHEN SIGNIFICANT TOTAL 
DEVIATIONS ARE PRESENT 

Ex. 4. In Table 3 are set out the details of the 
segregation for T,t (dark head— light head) in 
Fisher and Mather’s mouse backcross. It will be 
seen that the 1,013 mice are divisible into five groups 
according to the type of the male parent. These 
types were distinguished, before the commencement 
of the backcross, by their origin, and the grouping 
is in no wise dependent on the breeding results. 
Two of the male types comprise but one individual 
each, but the other three groups each contain several 
individuals. We have thus a hierarchical classifica- 
tion, the whole experiment being divisible into five 
major groups and each major group further sub- 
divisible into smaller subgroups. 

There are twenty-one such subgroups and on cal- 
culating a x 2 for the T,t segregation in each, and 
summing the results we should have a total % 2 for 
twenty-one degrees of freedom. A y 2 for five degrees 
of freedom can also be calculated from the segrega- 
tion totals of the male type groups. There are five 
types and each will give a y 2 for one degree of free- 
dom, hence the total obtained by summing the male 
type 2 2 s will have five degrees of freedom. Finally 
a x 2 for one degree of freedom can be calculated 
from the grand total segregation and will serve to 
detect deviation from the expected 1 : 1 ratio. Now 
the heterogeneity between the five male type segrega- 
tions could be obtained, on the assumption of a 1 : 1 
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TABLE 3 


Segregation for the factor T,t in a mouse backcross 
Individual Males Classes of Male Total 

a* n a« n a->t n t 


60 128 ' 

38 96 

6 21 

4 16 

27 55 

10 17 

10 26 J 

37 92 1 

18 45 

10 33 y 

11 21 

14 42 J 

49 122 

27 59 

37 80 1 

19 40 

20 49 

3 6 

3 12 

12 32 

12 21 J 


155 359 

90 223 

49 122 

27 59 

106 240 


427 1013 


segregation, by subtracting from the summed % 2 of 
the group segregations, the % 2 for the total deviation. 
This operation would be precisely the same as in 
the last example. The % 2 for heterogeneity between 
male types would then have four degrees of freedom. 
Similarly a % 2 for heterogeneity between individual 
males, but corrected for heterogeneity between 
groups, would be found by summing the twenty-one 
individual male % 2 & and subtracting from the resulting 
total the summed % 2 obtained from the five group 
segregations. The analysis would thus be : 


Deviation from 1:1 
tt , .. f Between male types 

e eiogenei y- g etween individual males 


D.f. 

1 

4 

16 


Total 


. 21 
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In Table ‘3 are given, for each individual male, 
each male group and the whole experiment, the total 
mice raised (n) and the number of t mice (a 2 ). There 
are in all 427 tt mice out of a total of 1,013. This 
is very much less than the expected 506-5, the devia- 
tion being significant as measured by ^ 2 . Thus in 
computing the heterogeneity % 2 a 1:1 ratio must 
not be assumed. We must take the observed total 
segregation of 586 : 427 and calculate the hetero- 
geneity x 2 011 this basis. We shall thus reduce the 
deviation x 2 to zero or in other words shall 4 lose 5 
one degree of freedom by calculating the best fitting 
total segregation, i.e. fitting the parameter x in 
(x + y) n . This principle of losing a degree of freedom 
on fitting a parameter will be used extensively later. 

The calculation of x 2 could be done as before. 
The expected numbers of T and t mice in each family 

,, , . , 586 _ ' 427 

or group would be given by n T and n t, 

where n is the family or group total, instead of J- n 
and n if the 1 : 1 were to be assumed. This is, 
however, a laborious method and an easier one 
developed by Brandt and Snedecor (cf. Fisher, 1936a) 
can be used. 

For each individual male’s family we calculate the 


quantity 


(«2 


where a 2 is the number of t mice and 


n the family total. The same is done for the male 
type groups and also for the whole experiment. 

(cio ) 2 

values are proportional to y 2 for that 


The 


(ct ) 2 

family. These values are entered in Table 4 in 
J n 

the same arrangement as the data of Table 3. In 
Table 4 we have three columns of values (the right- 
most having but one entry) and each column is 
summed. The totals are proportional to the ^ 2 s 
corresponding to (left) twenty-one individual males, 
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TABLE 4 

Values of derived from Table 3 

n 

Classes of Male Total 


66*9220 1 


34*7639 , 

Y 179*9891 

19*6803 
12*3559 


46*8167 


Totals 184*34 74 180-5388 179*9891 

Differences 3*8086 0*5497 

/ . 15*619 2*254 

Degrees of 

freedom 16 (i.e. 21 — 5) 4 (i.e. 5—1) 

Probability 0*5 — 0*3 0-7 — 0*5 

(middle) the five types of males, and (right) whole 
experiment, i.e. deviation. Then by taking the dif- 
ference between the male group total and the whole 
experiment value (middle and rightmost columns) we 
obtain a quantity proportional to yf for heterogeneity 
between the five groups. Similarly the difference 
between the leftmost and middle totals is propor- 
tional to for heterogeneity between males of the 
same group. These differences are converted into 
(^) 2 

^ 2 s by multiplying by — where a lt is the number 


Individual Males 
28-1250 
15*0417 
1*7143 
1-0000 
13-2545 
5-8824 
3*8461 
14*8804 
7*2000 
3*0303 
5*7619 
4*6667 
19*6803 
12*3559 
17*1125 
9*0250 
8*1633 
1*5000 
0*7500 
4*5000 
6*8571 
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of T mice, a 2t the number of t mice and n t = a lt + a 2t) 
in the whole experiment. (This is the adjustment 

(nt ) 2 i 2 2 

for the bad T : t segregation. - would be ~— 

U'itU'2 1 X I 

if a 1 : 1 ratio were to be assumed.) 

This multiplier is 530^49 7 *- e - ^=’10103. Then 


the male type heterogeneity is 0-5497 x 4*10103, 
i.e. 2-254 for four degrees of freedom and is not 
significant. The individual male heterogeneity is 
3-8086 x 4-10103, i.e. 15-619 for sixteen degrees of 
freedom and is not significant. Thus there is no 
heterogeneity and all the families agree in showing 
a serious shortage of tt mice. 

If there had been heterogeneity between types of 
male it would have been necessary to consider the 
individuals of each group separately from those of 
other groups. Each group would then have had its 
own multiplier based on the group values for a x , a 2 , 
and n. An example illustrating this procedure will 
be found in Eisher (1936a). 


8. THE CALCULATION OE 


In the examples worked in this chapter various 
formulae for calculating have been used . There arc 
a number of others, each suited for particular pur- 
poses, some of which may be conveniently noted 
here. Others will be given in later sections. The 
fundamental formula applicable to all cases is 

I fri mn \ 2 

y 2 = S\ where a and mn are the observed 

L mn 

and expected numbers in any class and S stands for 


summation over all classes. This formula may be 
given in a slightly different and more useful form 



where a and m are as before and n is the total number 
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of individuals in the data. This is identical with the 
previous formula and is also universally applicable. 

Tor a family segregating into two classes, whose 
expected ratio is 1:1, and the observed numbers 
are a x : a 2 

_ (a i - to,) 

* In 

The special cases of this formula for the more usual 
genetical expectations are : 


Ratio 
1 : 1 

•3:1 

15:1 

1:3 

0:7 


Tormula 

~(«i - « 2 ) 2 

n 

3^(«i - 3« 2 ) 2 


15)i 

1 


(«■! — 15o 2 ) 2 


3 w (3fli ~ fl i)~ 


7 / 9 

<h - 


<)« 


The Brandt and Sneclecor formula for the calcula- 
tion of heterogeneity % l from a hierarchical table, 
like Table 3, is 


X 


or 






(lltd# 

V n J 

”VT| 


Qi 


1 (% H 

® lift it 1 

' s l 

\ n J 

' n t 



CHAPTER III 

THE PLANNING OF EXPERIMENTS (I) 

9 . FAMILY SIZE 

I T is usual in genetical work for the scope of the 
experiments to be limited by such considerations 
as available space, labour, &c. It is thus necessary 
to make the best and most profitable use of the 
numbers of individuals that can be raised. The 
achievement of this end usually requires considerable 
care in the planning of the experiments, and statistical 
methods are often of great value in this connexion. 

In many experiments it is desirable to be able to 
pick out certain genotypes, usually homozygotes, 
for the purpose of establishing permanent lines or 
for the detection of some form of factor interaction. 
This involves making test crosses, the homozygote 
being distinguished from heterozygous individuals 
by the failure of segregation in its progeny. These 
progenies may be of very little value except for this 
specific purpose and so should be kept as small as 
is consistent with this end. Now, any progeny 
failing to segregate may still come from a hetero- 
zygous parent, but as the size of the family increases 
the chance of those which fail to segregate having 
come from a heterozygote becomes steadily smaller. 
The minimum size of the progeny designed to test 
some individual is then a statistical question involving 
consideration of the probability that any individual, 
in a family derived from a heterozygote, will be of 
the recessive type, and also of the permissible maxi- 
96 
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mum probability of obtaining a misleading result, as 
decided on by the experimenter. 

Let us consider an example. Suppose it is desired 
to test a series of individuals phenotypically dominant 
for one gene, in order to determine the homozygous 
individuals, by using a test cross to a recessive 
individual. The progenies of the homozygotes will 
not show segregation. The progenies of the hetero- 
zygotes are expected to segregate into one-half 
dominants (Aa) and one-half recessives (aa). The 
only error will arise from the failure of the progenies 
of some heterozygotes to contain at least one reces- 
sive. Let it further be decided that such a mis- 
leading result, i.e. failure of segregation in the 
progeny of a heterozygote, must not occur with a 
frequency of more than 1 per cent on the average. 

Now in the progeny of a heterozygote each indi- 
vidual has a chance of \ of being a dominant. Then 
a family of n individuals will all be dominant in (i) n 
of cases, This is the misleading result and must not 
occur in more than 1 per cent of cases. Then the 
minimum value of n is given by the solution of the 
equation 

ii) n ~ To o' 

Taking logarithms this becomes n log (§) = log (yjy) 


i.e. 

or 


- 03010 % 
2 

7 ? — - 

0-3010 


= - 2-000 
= 6 * 6 . 


The minimum size of the progeny must be 7. 

If we had been dealing with a case of two factors, 
i.e. where the individuals for testing could have been 
heterozygous for as many as two genes, a family of n 
individuals, % having been chosen for a certain proba- 
bility of failing to show segregation of one factor 
heterozygous in the parent, would be twice as likely 
to fail to show segregation for two factors. Then in 
such cases we must increase the stringency of our 
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test in order to allow for this fact. In the above 
example if we had been concerned with two factors 
in the test backcrosses we should have equated (-|)» 
to 2-^0, instead of T fo> and so obtained 7-6, or 8, 
as the minimum size of n in the test. 

Another and very similar problem is answered in 
the same way. Suppose, for example, we have an 
F 2 family segregating for two genes and we want to 
breed from a homozygous doubly dominant individual. 
Then how many phenotypically AB individuals must 
be used in order to include at least one AABB, the 
maximum frequency of failure to be ? 

Now out of every nine individuals of the pheno 
type AB in such an F 2 we expect one to be AABB. 
Then the chance of an individual being heterozygous 
for at least one factor is J-. The chance of all of 
n being so heterozygous is (f) w . The maximum 
allowable failure is -g-^o . 

Then (*)* = irk 

n log (f) = log (5-^0) 

- 0-0512% = - 2-6990 


and 


2*6990 
71 " 0-0512 


52*7 


At least fifty-three phenotypically AB individuals 
should be used. 

At the end of the book will be found a table 
(Table III) giving a series of such minimum numbers 
of progeny. The leftmost column shows the fraction 
of the individuals, which are expected to be of the 
distinctive type (e.g. it would be in the first and 
i in the second of the above examples) and along 
the top is the precision of the test. Thus the last 
example could be found in the table by looking along 
the row corresponding to % until the column of 
0*998 precision, i.e. 0*002 (i.e. -g-j^-) error, is reached. 
The value in the table is then the minimum value of n, 
the size of the progeny. As a further example of 
the use of this table, consider the question of the 
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minimum size of family required to obtain at least 
one individual of a type which is expected to comprise 
l of the total, the maximum error to be yV Then 
we look along row J and down column 0-980 (i.e. 
•ff precision) and find the value of 13*6 for n. 
At least fourteen individuals should be grown. 

10. DISTINGUISHING TWO SEGREGATIONS 

A problem superficially related to the foregoing 
but treated differently is that of determining the 
number of individuals necessary in order to decide 
between two different types of segregation, and 
consequently between the two hypotheses on which 
the segregation expectations are based. There must 
be some minimum probability laid down for this 
decision too. 

As an example, suppose we wished to decide 
whether one or both of two complementary factors 
was segregating in an F 2 . If only one were segre- 
gating, the dominant allelomorph of the other being 
present in all individuals, a 3 : 1 ratio would be 
found. If both were segregating, a 9 : 7 would be 
obtained. 

Let n be the size of the F 2 necessary for our purpose. 
There will be some number (r) of recessives, which 
if occurring in a family of size n will leave both 
hypotheses equally likely. If more than r recessives 
occur, then the 9 : 7 ratio is more likely, and if less 
than r recessives occur, the 3 : 1 is favoured. Hence, 
to solve the problem we make the family sufficiently 
large to ensure that r recessives, if found, will show 
a deviation from expectation on either hypothesis 
of a size that could occur only with that probability 
chosen as the maximum for misclassification. If the 
number of recessives is other than r, as indeed it? 
must usually be, then one or other hypothesis isi 
less likely than the maximum misclassification 1 
allowed, and so may be judged to be incorrect. 

The actual calculation may be done in either of 
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two ways, one using the standard error test of 
significance and the other the y 2 test. Let us con- 
sider the standard error method first. 

The standard error of the number of recessives 

expected with a 3 : 1 ratio is and with an 

V 16 

expectation of 9 : 7 is 

Now if we take 0*025 as the maximum allowable 
misclassification, we actually utilize the deviate 
corresponding to 0*05, becau.se deviation in but one 
of the two possible directions is misleading. We 
find, from Table I, that the deviation of r from the 
expected number of recessives must not be less than 
1*959964 times the standard error. 

Then, for the 9 : 7 expectation, 

and for the 3 : 1 expectation 

r - = 1-959964 

Then by addition 

- i) = 1-959964 Vn (V^h + 

, 16[". V7-937254 , 1-732051\"1 

and Vn = -j -959964^ - 16 + — \ J 

= 9-711919 
n = 94*32. 

The method of y 2 should give the same answer if 
applied correctly. For this approach it is necessary 
to note that we have two expected segregations, 
Z x : 1 and l 2 : 1. Then the observed segregation which 
will give equal ^ 2 s on both hypotheses is VTJ 2 : 1. 
In our case Z x ■= 3 and l 2 = f-, and so the ambiguous 
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segregation containing r recessives is, 


1-9640 : 1 and r — 


n 

2-9640' 



Now taking the ambiguous segregation of 

1- 9640/i _ n 

2- 9640 : 2-9640 


i.e. 


and calculating y 2 on the hypothesis of 3 : 1 we find, 
using the formula given in Section 8, 


'1 -9640m 3m 

2-9640 “2-9640 


3-841 


3m 


for the 0-025 level of deviation probability. We take 
yy = 3-841 for a probability of 0-025 as deviations in 
but one of the two possible directions are misleading. 
« 2 1 

Tta ^ x s [l-9640 -<,]■ = 3-841 


or 


n = 


3-841 X 3 x (2-9640) 2 
(1-9640 - 3) 2 
101-230 


1-073 


= 94-31 


This answer differs by about 0-1 per cent from the 
previous one — an error within the limits allowed by 
calculation. 

Thus we must grow ninety-five plants in order to 
distinguish between the two hypotheses with a 
minimum certainty of 0-025. 



CHAPTER IV 


THE DETECTION OF LINKAGE 
11. ANALYSIS OF £ 2 BY ORTHOGONAL FUNCTIONS 

H AVING dealt with the single factor ratios, 
attention may now be turned to the detection 
of linkage. It will be assumed for the present that 
no complications are introduced by aberrant single 
gene segregations. The cases where the single 
factors are not giving segregations in strict agree- 
ment with Mendelian expectation will be dealt with 
later (Chap. VIII). 

The method of analysis by is the most profitable 
approach to the detection of linkage. The procedure 
for the calculation of £ 2 is essentially similar to that 
appropriate to the single factor ratios. 

Let us consider a backcross involving two factors. 
One parent is Aa Bb, i.e. doubly heterozygous, and 
the other is the double recessive aa bb. If the two 
factors are each segregating in accordance with the 
Mendelian expectation of 1 : 1 and provided that 
there is no linkage, we expect four classes of offspring, 
AaBb, Aabb, aaBb, aabb, in equal numbers. 
Where m x is the expectation for the first class, m 2 
for the second and so on, 

m 1 = m 2 = m 3 = ra 4 = J 

Further, let a x - - - a 4 be the observed frequencies of 
the four classes, the total being represented by n. 

In the first place it is possible to calculate a £ 2 
for the joint deviation of all the observed frequencies 
32 



33 


THE DETECTION OF LINKAGE 


from their expectations, by the use of the formula 
s(--\ — n. This x 2 has three degrees of freedom, 


as there are four classes of which three may be filled 
arbitrarily. It must include two components which 
correspond to the deviations of each of the two 
single factor ratios from their expectations, and one 
for the joint segregation from its expectation of no 
linkage. Our task is clearly to separate these com- 
ponents in such a way as to allow of separately 
testing the three possible sources of discrepancy. 
The three degrees of freedom can be conveniently 
subdivided into 

1 for the deviation of the Aa segregation from 1 : 1 

1 55 jj 55 55 55 IBfo , , ,, 1.1 

and 1 detecting association of the two factors in 
segregation, our expectation or null hypothesis 
being, of course, that they are independent. 

The two ^ 2 s corresponding to the first two degrees 
of freedom are calculated by the methods given in 
the last chapter. In this case 




(gi + a 2 ~ a z — q 4 ) 2 
n 


and 


X 2 b 


(<H — a 2 + a 3 — a 4 ) 2 
n 


It is easily seen that these reduce to the corresponding 
formulae of Section 8. 

The formula for the calculation of that ^ 2 value 
corresponding to the third, or c linkage degree of 
freedom follows from the two already employed by 
application of the principle of orthogonality and is 

2 „ ( a l ~~ — ^3 + <* 4) 2 


Any other formula would yield a which would be 
based on a comparison itself not independent of the 
two comparisons already used. This would clearly 
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defeat our object of testing the three sources of 
deviation separately. The principle of orthogonality 
and the arrangement of formula based on independent 
comparisons is dealt with more thoroughly in the 
last section of this chapter. Having obtained these 
three separate % 2 values each with one degree of 
freedom it is possible to test the three sources of 
discrepancy individually. 

Ex. 5. As an example of this type of analysis we 
may take the data of Philp (1934) on the joint 
segregation of the two factors p and t in the poppy. 

In a backcross progeny X 

following classes and frequencies. 


he observed the 


TABLE 5 


Observed 

PpTt 

191 

Pptt 

37 

PpTt 

36 

pptt 

203 

Total 

467 

Expected (with 
no linkage) . 

116-75 

116-75 

116-75 

116-75 

467 


A x 2 f° r three degrees of freedom as calculated 
from these four classes as they stand has the value 
of 221 *266. This is clear indication of strong deviation 
from the expectation of equal classes. To what is 
the deviation due % The next step is to subdivide 
X 2 into its three components. 

First take the deviation of P,p segregation from 
the 1:1. This component of % 2 is found by adding 
the first and second classes together and the third 
and fourth classes together, taking the difference 
and basing the calculation on this. The formula in 
the previous notation is 

„, 2 («1 + a 2 - a z - a 4 ) 2 . 

= 

X 2 is then found to be 0-259. 

Similarly x 2 for the deviation of the T, t segregation 
from 1 : 1 is found to be 0-362. 

The third component, that detecting linkage, is 
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based on the difference between the sums of the first 
and fourth classes and the second and third classes. 
The formula is, as before, 

0 (a x - a 2 - a 3 + a 4 ) 2 (191 - 37 - 36 + 203) 2 

r*= » = n 

and this component proves to be 220*645. 

The three components and their probabilities may 
now be tabulated as in Table 6. 

TABLE 6 

D.F. Probability 

Segregation for P,p . 0*259 1 0-7 — 0-5 

Segregation for T,t . 0-362 I 0*7 — 0*5 

Joint segregation . 220-645 1 extremely small 

Total . . . 221-266 3 

The total of the three components agrees with 
the compound y 2 previously calculated by a different 
method, so demonstrating that the working is 
correct. 

It is clear from this partition of y 2 that the two 
single factor ratios are individually good, but that 
there is very strong evidence for the belief that the 
factors are not segregating independently of one 
another. The two dominant allelomorphs and the 
two recessive allelomorphs are associated too often, 
or, in other words, there is very strong evidence for 
the existence of linkage in the coupling phase. 

This same method of analysis may be applied to 
data obtained from inbreeding doubly heterozygous 
individuals. In this case the four classes are expected 
to occur in the ratio 9 : 3 : 3 : 1 . The formulae for 
the three components in this type of family are 
somewhat different from those used in the case of 
the backcross. The two components corresponding 
to the single factor ratios are calculated from the 
formulae for the single factor ratios with expectation 
3 : 1 (cf. Section 8). The third component then 
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follows from orthogonality. The three formulae are : 

9 ( a i 4" — 3<z 3 — 3a 4 ) 2 9 (cii Sa 2 -\~a 3 — 3a 4 ) 2 

x B = — 

(cq — Sa 2 3$3 4* 

X L — g n 

Where a number of families of the same type are 
available it is possible to subdivide ^ 2 not only into 
the three component parts discussed above but also 
into portions depending on deviation of the totals 
and on heterogeneity respectively as in the case of 
the single factor ratios previously considered. This 
will be made clear by an example. 

Ex. 6 . The joint segregation of the two factors 
A, a and B,b in Pkarbitis , Morning Glory, has been 
studied by Imai (1931). He records the segregations 
in three families as shown in Table 7. 

TABLE 7 


Family 

1 . 

AB 

47 

Ab 

8 

aB 

11 

ab 

9 

Total 

75 

,, 

2 

75 

14 

14 

n 

114 

,, 

3 ! 

. 65 

13 

12 

11 

101. 

Total 

# 

. 187 

35 

37 

31 

290 


First consider the totals of the combined families 
as given in the bottom row of the table. The three 
components of ^ 2 are calculated from the formulae 
shown above. The following analysis is then 
obtained : 


Segregation for A, a 

„ B,b 

.Joint' Segregation 


V 2 D.F. Probability 
0-372 1 0-5 - 0-3 

0-777 1 0-5 - 0-3 

23-946 1 very small 


Total. .. . . 25-095 3 


It is thus quite clear that again the single factor 
ratios account for very little of the total % 2 but that 
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there is a large component corresponding to linkage. 
There is undoubtedly evidence for linkage of the two 
segregating factors. 

This process of analysis may be carried out for 
each family separately. In each case there will be 
three % 2 s each corresponding to one degree of freedom. 
This is carried out and tabulated in Table 8. 



TABLE 8 

x 2 


Family 

I . 

Factor pair 

A, a 
. 0TI1 

Factor pair 

B,b 

0-21S 

Linkage 

7-468 

2 

. 0-573 

0-573 

7-895 

3 . 

. 0-267 

0-083 

8-714 

Total 

. 0-951 

0-874 

24-077 


The bottom row of the same gives the sums, over 
all three families, for each component. Each sum 
has three degrees of freedom, one being contributed 
by each family. The analysis into deviation and 
heterogeneity, portions is now carried out as in the 
examples of Chapter II. The deviation portion has 
already been obtained in the previous table. The 
difference between this and the corresponding total 
from Table 8 is the heterogeneity y\ In each case 
the deviation y 2 will have one degree of freedom 
and the heterogeneity y 2 will have 3 — 1, i.e. two 
degrees of freedom. This partition is shown in 
Table 9. 



TABLE 

9 



Deviation . 
Heterogeneity 

A, a 

. 0-372 

. 0-579 

B,b 

0-777 

0-097 

Linkage 

23-946 

0-131 

n.f. 

1 

2 

Total 

. 0-951 

0-874 

24-077 

3 


The heterogeneity y 2 are none of them significant. 
We can now add the further statement that the 
families are homogeneous for each component. They 
agree in showing good single factor ratios and they 
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also agree in showing linkage of the two factors. 
It may be noted that in the case of the linkage 
component the heterogeneity y 2 will be somewhat 
too low as has earlier been shown to be the case 
with single factor ratios when a significant deviation 
is recorded. The difference between the value found 
for this linkage heterogeneity y 2 and- the true value 
will, however, be small and since the value found 
above is very small we can assume that the true 
value will not reach the level of significance. The 
Brandt and Snedecor technique cannot be applied to 
finding^ the true value, as y 2 is calculated from four 
classes weighted in different manners. 

Thus the use of the y 2 test of goodness of fit allows 
of analysis of the data which not only detects irregu- 
larities but also shows precisely where the irregulari- 
ties occur. The presence of linkage is often obvious 
as in the data worked in Ex. 5, but this is not always 
the case. Before its presence is assumed, a sensitive 
statistical test, as is provided by ^ 2 , should be 
applied. In the single families of Ex. 6 the total y 2 
which corresponds to three degrees of freedom shows 
a barely significant deviation from expectation 
(e.g. Family 1 y 2 = 7*798 D.F. = 3 Probability 
0-05 — 0*02). The advantage of the analysis in 
such a case lies in its showing that two of the three 
possible sources of deviation, the single factor ratios, 
contribute very little to y 2 whereas the third, linkage, 
contributes much. The sensitivity of the test for 
linkage is thus very greatly increased. 

12. ORTHOGONALITY 

The successful analysis of y 2 into its components 
depends on the choice of functions which give inde- 
pendent comparisons. That this must be so, is clear 
when it is remembered that y 2 is analysed in order 
to locate the precise place in which the results fail 
to conform to expectation. It would be inefficient, 
for the detection of linkage, to calculate some y 2 
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value which is not independent of single factor 
segregation on the hypothesis of no linkage. 

In the two examples worked above the reason 
given for the choice of the functions used in calculating 
the % 2 components was that such functions were 
orthogonal. The following discussion of ortho- 
gonality will help to make this choice clear. 

Consider a segregation into four classes of expecta- 
tion m 1 to m 4 and with observed frequencies of 
a x — a 4 . Various linear functions of the observed 
frequencies may be taken, the general form being 
x = k ± a x + k # 2 + k z a 3 + ha A . 

Where V x is the random sampling variance of the 

linear function x, it can be shown that is distributed 

y x 

as a x 2 f° r one degree of freedom. If the coefficients 
of a v &c., are chosen correctly the resulting x 2 will 
detect deviations from- some specific expectation 
of the class frequencies. In the case where 
m 1 = 3m 2 = 3 m 3 = 9 m 4 , i.e. in an F 2 family segre- 
gating for two factors, and the choice of k x = k 2 — 1 
and k z = ki = — 3 is made, the resulting x 2 will 
detect deviations of one single factor segregation 
from the expected 3:1. When the expected value 
of x, i.e. h m i + h m i + k z m z + hw 4 > 0 the value 

of V x is obtained from the formula — V x = S(mJc 2 ) 

71 

where n is the number of individuals in the family. 
If the expected value of x is not 0 this formula for 
V x ceases to hold. So the coefficients should be 
chosen to make the expectation of x zero. In the 
case mentioned above 

kiTfl/i == koTYl^ = T'g'j Xe " } hw* TIT 

and so 8 (km) — 0. 

Then -Vx = S(mk*) = X V(9 + 27 + 3 + 9) = 3 

n X 2 

/. Vx = 3?i and % 2 = ^ 


4 
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Any number of such functions may be chosen, but 
they will not all be orthogonal. Orthogonality is 
tested by calculating the quantity S(mkk'). This 
should be zero. If it is not found to be so the two 
functions are not based on independent comparisons. 

In addition to the function taken above, let us 
take a second one, x\ where k\ = &' 3 = 1 and k\ = 
k\ = - 3. Then S(mJc') = A(9-9+3-3)=0 

Vx = n(Sm)c' 2 ) = ^(9 + 27 + 3 + 9) = 3» 

— will then be distributed as y 2 . 

3 n A 

Furthermore, 

S(mJck') = tV (9 - 9 - 9 + 9) = 0 

and so this and the previous function are orthogonal. 
In point of fact they are based on the single factor 
ratios of the two genes in the segregation. Thus we 
have taken two functions each giving a y 2 of one 
degree of freedom. The third component still remains 
to be determined. It will be of the nature of what 
is termed, in factorial experimentation, an c inter- 
action’, as it will detect association of the two 
primary factors in segregation. The coefficients of 
the observed class frequencies must be chosen to 
make this third functional orthogonal to the other 
two. They are easily found by a multiplication 
process. 

k'\ = k x k\ = 1x1 = 1 
k 2 — kjc 2 = 1 X — 3 = — 3 
k 3 = kjc 3 = — 3x1 = — 3 
k'\ = kjc\ = — 3x— 3 = 9 

Then x" = a x — 3a 2 — 3 a z + 9 a 4 , with S(mk " ) = 0 

and Vx" = %S[mk" 2 ) = ^(9 + 27 -f 27 + 81) = 9 n 

x* f 2 

X 1 is given by — . 
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The orthogonality is tested as before and we find 
S(mJcJc") = T V (9 - 9 + 27 - 27) = 0 
S(mk'k ") = X V (9 + 27 - 9 - 27) = 0 

Thus the analysis of % 2 in an F 2 family is conducted 
by calculating the quantities 

X 2 = 3a 3 3a 4 ) 2 

X % = ^ - 3a 4 ) 2 

£ 2 = iL ^ 1 ~ 3a 2 _ 3 “ s + 9a *) 2 

The functions chosen for the calculation of the 
first two £ 2 are the same as those developed in the 
previous chapter from the detection of deviations 
from single factor ratios. The third function, detect- 
ing linkage, then follows from the above considera- 
tions. Other sets of three orthogonal functions could 
be chosen but would not have the same meaning for 
the analysis as do those adopted above. 

The three functions chosen for the analysis of the 
backcross data in Ex. 5 can be derived in the same 
way as the functions for the E 2 - Similarly it can 
be shown that in the case of two factors one of which 
is a member of two segregating duplicate genes, giving 
four classes with the expected ratios 45 : 15 : 3 : 1, 
the three components of are calculated from the 
formulae 

X 2 a = 3 ^(«i — 3a 2 ■+ a 3 — 3 a 4 ) 2 

X 2 b = j^(«i + a » — 15a 3 - 15 «4) 2 

y 2 L — 3a 2 15a 3 -{- 45a 4 ) 2 

The derivation of these functions, and the demon- 
stration of their orthogonality, is left as an exercise. 
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A more complicated example is that of a backcross 
for three factors. Eight equal classes are expected, 
provided the genes are unlinked and showing good 
individual segregation ratios. There are seven degrees 
of freedom and so seven # 2 s can be calculated. One 
set of seven orthogonal and the most useful set 
for genetical analysis, is obtained by giving the 
classes coefficients as shown in Table 10. The 
variance of each function can be shown to be n. 
The ^ 2 s are then calculated by squaring the quantities 
derived from the Table 10 and dividing by n in each 
case. 

The quantities in the table are orthogonal and are 
easily derived by the multiplication method previously 
employed. The first three (1, 2, and 3) are those 
which detect deviations from 1 : 1 in the three single 
factor ratios. They are made up by giving all the 
classes, dominant for the factor under consideration, 
a coefficient of 1 and all classes recessive for the 
factor a coefficient of — 1. The interaction or 
linkage function for factors Aa and Bb (4) is then 
obtained by multiplying the coefficients of the single 
gene ratio functions of these two together (1 and 2). 
Similar linkage functions are obtained for A, a and 
C,c, and B,b and C,c. The seventh and last degree 
of freedom corresponds to a function which has no 
simple genetical meaning but which is necessary to 
complete the analysis. The coefficients of this 
function are obtained by multiplying those of the 
first (A, a) single gene ratio function (1) by those of 
the function corresponding to linkage between B,b 
and C,c (6). It may be also obtained from multipli- 
cation of the second (B,b) function (2) by the linkage 
function of A, a and C,c (5) and finally by the 
corresponding multiplication of the third and fourth 
functions (3 and 4) of the table. This multiplication 
method of obtaining the orthogonal functions corre- 
sponding to linkage degrees of freedom may be 
employed in any case that may arise. 
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TABLE 10 

Backcross for Three Factors 
Coefficients of the Functions for Calculation of 
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Function 

l 

2 

3 

4 

5 

c> 

7 

Genetical 

Class 






i 


AaBbCc 

1 

1 

1 

1 

1 

l 

1 

AaBbcc 

1 

1 

-1 

i 

-1 

-l 

-1 

AabbCc 

1 

-1 

1 

-1 

1 

-l 

-1 

Aabbcc 

1 

-1 

-1 

-1 

-1 

i 

1 

aaBbCc 

-1 

1 

1 

-1 

-1 

l 

-1 

aaBbcc 

-1 

1 

-1 

-1 

1 

~i 

1 

aabbCc 

-1 


1 

1 

-1 


1 

aabbcc 

-1 

-1 

-1 

1 

1 

l 

-1 

Total 

0 

0 

0 

0 

0 

0 

0 


Function 1 gives x 2 for the segregation of the single factor A, a. 

99 2 99 99 99 )9 99 99 B * l ). 

99 ^ 99 99 99 99 99 99 C , C . 

„ 4 „ „ linkage between A, a and B,b. 

99 5 99 99 99 99 A, 3 ,, C , C . 

„ 6 „ „ „ , B,b „ C,c. 

„ 7 completes the analysis. 



CHAPTER V 

THE ESTIMATION OF LINKAGE 


13 . CRITERIA OF ESTIMATION 

H AVING detected the presence of linkage in a 
segregation for two or more genes the next 
step is, naturally, to obtain some measure of the 
intensity of the linkage. It should be noted at this 
point that though detection of linkage involves no 
hypothesis as to the nature of linkage, being merely 
the demonstration that the hypothesis of free segrega- 
tion is not true, the measurement of the intensity 
of linkage involves the calculation of a statistic 
relevant to some hypothesis of its nature. For 
example, the measure of linkage is a different one 
if the chromosome theory is accepted from that 
measure used if one adopts the gametic reduplication 
hypothesis. The chromosome theory, which is 
generally accepted, leads to a measure of the intensity 
of linkage based on the frequency of breakage and 
rejoining of the homologous chromosomes between 
the loci concerned. This is estimated as the propor- 
tion of recombination chromosomes. In the case of 
diploid organisms, this is the same as estimating the 
frequency of recombination gametes. 

Having thus decided on the quantity, or parameter, 
as it is termed, which is to be estimated, the next 
decision to be taken is as to the method of estimation. 
Two criteria must be satisfied in the case of linkage 
estimation. The first, that of consistency, concerns 
the statistic itself. Care must be taken that this is 
44 
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really an estimate of the parameter concerned and 
not of something different. This appears, at first 
sight, to be obvious but it must be remembered that 
different types of data lead to estimates of different 
things. The backcross allows of direct estimation of 
the recombination fraction, whereas F 2 data can at 
best only give estimates of either the square of the 
recombination fraction, or of the square of one 
minus this fraction. Where the recombination frac- 
tion differs on the male and female sides, the product 
of the two fractions, or of one minus each fraction, 
is estimated. To overlook this possibility would be 
most misleading. 

The second criterion concerns the precision of the 
estimate. We must take care to obtain the most 
precise estimate possible, in the sense that the 
estimate should have the smallest variance, or 
standard error, that the data can give. This is the 
criterion of efficiency. The reasons for the applica- 
tion of this criterion will be made clear later. In 
some cases an inefficient estimate may be employed 
if the efficient estimate is difficult to obtain, but this 
is a course which should not generally be followed. 
A third criterion, that of sufficiency (see Fisher, 
1936) is involved in some cases of estimation but can 
be neglected in the case of linkage. Furthermore, 
the method given below for linkage estimation will 
lead to a sufficient estimate, if one exists. 

The satisfaction of the criterion of consistency is 
a matter of choosing the right quantity to estimate. 
This will be illustrated fully by the examples worked 
below. Satisfaction of the criterion of efficiency is 
one which can only be considered mathematically. 
It will be sufficient to say here that the method given 
below has been shown always to lead, in the theory 
of large samples, to an efficient statistic, i.e. an 
estimate having the smallest standard error of which 
the data will allow. 

This method is that of Maximum Likelihood. 
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The principle of the method is easily grasped. Let p 
be the recombination fraction, m l . . . m t , the 
expected proportions of individuals in the segregate 
classes l ... t, and a 1 . . . a t the numbers of 
individuals observed in these classes. The expected 
proportions, denoted by m , are known in terms of 
p, the quantity which is to be estimated. 

The likelihood of obtaining the observed family is 
given by a term of the expansion of 
(ra x + m 2 . . . + wk) n 

where n is the total individuals in the family (see 
Fisher, 1921). The relevant term is, 

ai!ai , tT (*%)“! (™J“. . . . (m t y t 

The method of maximum likelihood depends on 
the maximization of this expression, with respect to 
p. It is difficult, however, to differentiate such an 
expression and resort is made to a device for this 
purpose. The expression and its logarithm will 
both be maxima at the same value of p. Hence 
we may find the requisite recombination fraction by 
maximizing the logarithm of the likelihood expression 
with respect to p. 

The logarithm of the likelihood expression, denoted 
by L, is 

L = C + a x log ra x + a z log ra 2 + . . . a t log m t 
where C is a constant depending on the coefficient 
of the likelihood term. This will vanish on maximiza- 
tion by differentiation and so may be neglected. 

Differentiating and equating to zero leads to the 
equation of estimation 

d log m i , „ d log m 2 , d log m t A 

dp dp dp dp 

One of the solutions of this equation will be the 
desired value of p. There is never any doubt as to 
which root is required since all the others lead to 
impossible values for the recombination fraction. 
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14. THE CALCULATION OF MAXIMUM LIKELIHOOD 

Ex. 7. Let us consider, the estimation of the 
recombination fraction p in the case of the factors 
p and t in the poppy (Philp’s data). It has been 
shown in the previous chapter (Ex. 5) that the data 
give evidence of linkage of these factors. The cross 
was one of the double heterozygote in the coupling 

phase to the double recessive ^ X — ^ . All the 

gametes from the recessive parent will be pt and so 
do not enter into our consideration of the problem. 
The gametes from the heterozygous parent will be 
of four kinds, two of which are old, or original, 
combinations and the other two new, or re-, com- 
binations. In this way the expected frequencies 
shown in Table 11 are arrived at. The corresponding 
observed numbers are also shown. 



TABLE 11 



Observed 

PT 

Pt 

pT 

pt Total 

191 

36 

37 

203 467 

Expected 

n n \ 

* 2 (1 - 

n 

2 P 

n 

•f(l - P) n 


The logarithm likelihood expression is thus 
L = 191 log (|- — -|p) + 36 log (|p) -f 37 log (Jp) 

+ 203 log (-| — Jp) 

and maximizing by differentiation and equating to 
zero the equation of estimation becomes : 

<EL 191 36 37 203 

dp 1 — p "p "p 1 ~ p~ 

73 

The solution is p — = 0*1563 or 15*63 per cent. 


It will be noticed that in the case of the backcross 
this method of estimation leads to the formula which 
is in universal use for data of this kind, viz. 


a 2 ~b 


o,r < 

— or 


in the case of repulsion 
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Having obtained our estimate of p } we are next 
concerned with its standard error (s p ) 9 Where V P is 
the variance, i.e. the standard error squared, of p, it 
can be shown that 



Si mn 


d 2 log m 
dp 2 


This is easily calculated for the present example. 
We already have e ft- We must differentiate 

for a second time and then substitute the expected 
for the observed value, i.e. nm for a. This gives 


2 V 1 — p P P 1 — p 

467 


p(l — p) 


~ V) 


Inserting the estimated value of p we obtain 
V p = 0*0002824 and s p = W p = 0*0168 or 1*68 per 
cent. 

The general formula for s P in the case of a back- 

cross is / Pjl ~Z P) which is the formula in universal 
V n 

use. It is of some interest that, in this simple case, 
the formula given by the method of maximum likeli- 
hood is that previously arrived at by the application 
of simpler statistical considerations. 

Ex. 8. As a further example of estimation, we 
may consider the data of Imai concerning the genes 
A, a and B,b in Pharbitis . These were shown earlier 
(Ex. 6) to show strong evidence of linkage in the 
coupling phase. These data are from E 2 families. 
The first thing is to ascertain the expectations of the 
four observed classes in terms of p. The gametic 
series of expectations will be the same as the series 
expected in the case of the backcross. But we have 
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no reason to assume that the recombination fraction 
is the same in both male and female game togene sis. 
We may represent the former by p x and the latter 
by p 2 . The gametic series of expectations are then : 



TABLE 

12 




AB 

Ab 

aB 

ab 

3- 

■ 1(1 ~ Pi) 

iPi 

iPi 

ia-Pt) 


■ 4(1 - Pi) 


iP* 

i(i - Pt) 


From this table it is possible to build up the expecta- 
tions of the four phenotypically distinct F 2 classes. 
The double recessive class can only result from 
mating of doubly recessive male and female gametes 
and so will be expected in J(1 — p x ) (1 — p 2 ) of 
cases. The total incidence of a plants is J and 
so the singly recessive class, aR, will occur in 
J(1 — (1 — Pi) (1 — p 2 )). The other singly dominant 
class will have equal expectation and the doubly 
dominant class, AB, must be £(2 + (1 — p x ) (1 — p 2 )). 
Thus all the expectations are dependent on the 
quantity (1 — jPi) (1 — Vz)- This is the parameter 
which can be estimated. If we care to assume that 
p x = p 2 it is possible to obtain an estimate of p, but 
only if this assumption is made. 

Let us write P for (1 — p ± ) (1 — p 2 ). Then the 
expectations of the four classes are as shown in 
Table 13. The observed frequencies of the four 
classes are also shown in that table. 

TABLE 13 

Class AB Ab aB ab 

Expectation . ~(2 + P) ^(1 P) ~(1 — P) 

Observed . . 187 35 37 31 

The logarithm likelihood expression is then : 

L = 187 log (-J + \P) + 35 log (| - IP) + 37 log 
(i - IP) + 31 log IP 
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Maximization leads to the equation : 

dL __ 187 35 37 31 __ 

dP “ 2 + P 1 - P 1 - P + P " ° 

which reduces to the quadratic 

62 - 12P - 290P 2 = 0 
giving P = 0*4835. 

The variance of P is to be obtained by the method 
given previously. Redifferentiating and substituting 
nm for a 

_ .1 = _ ?/_!_ + _J — i L_ + L\ 

Fp 4^2+P+l-P+l-P + pJ 
2P(1 - P)(2 + P) 
p “ »(1 + 2P) 

Hence F P = 0-002174 and s p = 0-04663. 

If we now care to assume that p x ~ p 2 we have 

(1 - p) = VP = 0-6953 

p = 0*3047 or 30*47 per cent. 

The variance of p is then found from the variance of 
P. It can be shown that 

i__j. / dpy 

V p ~~ Vp\dp) 

( tj p\ 2 
— \ = 4P. 

Hence V v = ^ = 0-001124 

^ 4P 

and = VV P — 0*03353 or 3*35 per cent. 

It will be noticed that linkage could be detected 
by the calculation of p and its standard error and 
then testing the significance of the deviation of p 
from the freedom value of 0*5. This method,, if 
correctly applied, should give the same result as the 
use of the method. It has, however, the dis- 
advantage of not leading to such a fine analysis of 
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the situation, in respect of heterogeneity tests, &c., 
as does the test. The latter is to be preferred for 
detection. 

15. THE EFFICIENCY OF STATISTICS 

It may be wondered at that the method of estima- 
tion bears no relation to the calculation of ^ 2 which 
is so good for detection. It would superficially seem 
reasonable to employ the linear function x from 
which is calculated as a method of estimating the 
recombination fraction. Let us consider the estima- 
tion of P from an F 2 family using this method. 
Where the four observed classes are a x ... the 
linkage function for the calculation of is 
x = cl i 3<x 2 3a 3 — |- 9 

The expected value of this function in terms of P is 

7 ? 

x = — (16P - 4) 

Then P may be estimated from the equation 
n(4:P — !) = «! — 3a 2 — 3 a z + 9 a 4 

or P :=zz 2& 2 — 2d>3 -j- 10^4) 

We are next concerned with the estimation of the 
standard error of P arrived at by this method. The 
sampling variance of such a statistic is obtained from 
the formula 

nV P = S(mJc 2 ) — P 2 (Fisher 1936a) 
which is related to the formula used for obtaining 
the sampling variance when calculating ^ 2 . The 
chief difference is due to the fact that in this case 
S(mk) does not equal 0. 

Here 

m 1 = J(2 + P) m 2 = m 3 = J(1 — P) = \P 
and Jc \ ==: I — ■ k& — ^ ^4 == 2 2 _ 

4:nV P = |(4 + 24P) - 4P 2 
1 + 6P - 4P 2 
4:71 


and so 
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This is not the same as the variance of the maxi* 
mum likelihood statistic. Now the smaller the 
variance the more precise the estimate and so the 
more efficient the statistic. The maximum likelihood 
statistic has the smallest possible variance and so is 
always 100 per cent efficient. The efficiency of any 
other statistic may conveniently be expressed as the 
ratio of the variance of the maximum likelihood 
statistic to the variance of the statistic in question. 
In the case of Imai’s data the efficiency is then 
8P(1 - P)( 2 + P) 

(1 + 6P + 4P 2 )(1 + 2P) 

P was found to be 04835 and so the efficiency is 
0*8505 or 85*05 per cent. 

Where the value of P is £, i.e. p is \ in the absence 
of linkage, the efficiency of the statistic in question 
is 1. It is thus fully efficient for the detection of 
linkage, and so the use of % 2 for this purpose is 
justified. Where the linkage value is small the 
efficiency of this particular statistic is very low and 
will lead to the most misleading results. With no 
recombination, p = 0, the efficiency is zero. As an 
example of the trouble which the use of inefficient 
statistics leads to, we may consider the estimation 
of the recombination fraction in some F 2 data showing 
tight linkage of two factors. 

Ex. 9. The data are on the segregation of the two 
factors, G,g and L,1 in an F 2 of Primula sinensis 
(Be Winton and Haldane, 1936). The factors were 
in the coupling phase and the following segregation 
was observed : 

TABLE 14 

GL G1 gL gl Total 

Observed . 977 16 19 360 1,372 

Expected . ~ (2 -f P) ^ (1 — P) ~ (1 — P) ^ P 
where P = (1 — p) 2 . 
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Estimation by the method of maximum likelihood 
as in the previous example leads to the results 
P = 0-9507 p = 2-50 per cent assuming n, = ® 0 
F P = 0-3294 x 10- 4 

Estimation by the use of the linear function related 
to leads to the values 

P = 0-9993 p = 0-04 per cent 
F P = 0-5469 x 10“ : 3 

The efficiency of this last estimate is thus 2ffl003294 

0-0005469 

or 6-02 per cent. 

To express this in another way, an equally precise 
estimate could have been obtained from eighty-three 
plants if the method of maximum likelihood had been 
employed. Nearly 94 per cent of the information 
in the data has been wasted by inefficient estimation. 
Furthermore, it will be seen that the estimate 
obtained is very different from that yielded by the 
method of maximum likelihood. The difference 
between the two estimates is significant. Thus the 
second statistic is not only wasteful of the data hut 
is also clearly wrong. 

These deficiencies of the inefficient statistic also 
result in another serious difficulty. Having obtained 
an estimate of P we can calculate the expectations 
of the various classes and apply a x 2 test to determine 
whether the whole of the discrepancy originally 
detected in the data is accounted for by the presence 
of linkage between the two factors. In the present 
example the substitution of the value 0-9507, the 
maximum likelihood estimate, for P leads us to 
expect the four classes in the frequencies shown in 
Table 15, middle row. 


TABLE 15 


GL G1 


! 

Expected 


Observed . 977 16 

fP = 0*9507 1012-03 16-91 
[P = 0-9993 1028-76 0-24 


gL gl Total 

19 360 1,372 

16-91 326-09 1,372 
0-24 342-76 1,372 
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Calculating for the agreement of these expected 
values and the observed gives the value x 2 = 5*050. 
The number of degrees of freedom is reduced from 
three, the number when testing goodness of fit on 
the assumption of no linkage, to two because a 
statistic has been calculated and so the number of 
classes which can arbitrarily be filled is one less than 
before. This value of x 2 for two degrees of freedom 
has a probability of between 0*10 and 0*05, and so 
does not indicate any serious deviation from expecta- 
tion. Nearly the whole of the original discrepancy, 
from the hypothesis of good single factor ratios 
and no linkage, is accounted for by the assump- 
tion of linkage, as an analysis of x 2 would have 
suggested. 

If we take the second estimate of P, 0*9993, and 
.calculate the expectation from it the frequencies are 
very different. In this case it is not possible to cal- 
culate a x 2 for the goodness of fit as the expectations 
are very much less than 5. It will be remembered 
that x 2 cannot be used in such cases, as it fails to 
follow the tabulated distribution at all closely, when 
this minimum expectation is not reached. It is, 
however, quite clear that in the present example the 
fit of the observed values to the second set of expecta- 
tions, those derived by inefficient estimation, is very 
poor. It can be stated as a general rule that the x % 
test of goodness of fit cannot be used when some 
statistic, necessary for the calculation of the expecta- 
tions, has been arrived at by inefficient estimation 
(Fisher, 1928). 

So far the only efficient method of estimation con- 
sidered has been that of maximum likelihood. For 
any given problem of estimation other efficient 
methods may, and often do, exist. Linkage values 
are estimated efficiently by the use of the product 
formula (Fisher and Balmakund, 1928 ; Immer, 

1930). This method equates the fraction to its 
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_ m 1 m 4 . 

expected value to give the equations of estima- 

tion. In the case of the F 2 the equation is 
a x a 4 _ 2P -f P 2 
a 2 a 3 1 — 2P + P 2 

Any value of the left side of the equation corresponds 
to but one value of P, and consequently tables for 
the solution of these equations are easily made. 
Such tables have been prepared by Immer (1930) and 
it is only necessary to find the value of the fraction 

— and then look up the corresponding value of P 
ct> 2^3 

or p in the table. This method has a number of 
advantages in the estimation of recombination frac- 
tions, particularly that by its use certain difficulties 
encountered in handling data showing poor viability 
of certain genotypes are minimized. It will be con- 
sidered in more detail in this special connexion in 
a later chapter. 

The method of maximum likelihood is, however, 
the only method which leads to efficient estimates for 
all types of problems of estimation. All other 
methods need testing against this method before it 
is decided that they are efficient and consequently 
to be used. 



CHAPTER VI 


INFORMATION AND THE PLANNING OF 
EXPERIMENTS (II) 

16 . THE AMOUNT OE INFORMATION AND ITS USES 

T HE previous chapter has introduced the concept 
of a finite amount of information concerning 
a linkage value in a given body of data. Any given 
segregation will allow of the calculation of a linkage 
value whose maximum precision is determined by the 
expected values of the classes and the total number 
of individuals in the family. The whole of this 
information relevant to the recombination fraction p 
is extracted by the use of the method of maximum 
likelihood, but certain other statistics utilize only 
part of it, and consequently are less efficient estimates 
of the parameter in question. This concept of the 
amount of information present in any body of data 
is of great value in the planning of linkage experiments 
and a method has been developed to allow of its exact 
treatment. 

The greater the amount of information concerning 
the recombination fraction, the greater the precision, 
or the less the variance, of the estimate, and so it is 
convenient to define the total amount of information 
in the data as the inverse of the variance of that 
statistic obtained by the use of the method of maxi- 
mum likelihood. 

7 = — 

* V p 

A further convenient distinction may be drawn 
56 
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between the amount of information yielded by a 
whole family of n individuals and the average amount 
yielded by a single individual of the family. Then 
Ip = flip. 

The variance, of the estimate of p obtained by the 
method of maximum likelihood is always a minimum 
and is calculated from the formula 



This gives an easy method of calculating 
of ip 


3 [ 


m 


d 2 log m\ 
# 2 ) 


the value 


which is at a maximum for the body of data in ques- 
tion. An alternative formula that is sometimes 
easier and more convenient to use is 



(N.B. — This is identical with the previous formula.) 

The calculation of i p allows of the comparison of 
the precision of estimates of a parameter from two 
entirely different types of data. Note that for this 
purpose we use i p rather than I p since the former is 
independent of the number of individuals in the 
family. It is a measure of the value of single 
individuals in segregations of the types under 
consideration. 

Ex. 10. . As an example of the calculation and use 
of quantities of information let us consider the pre- 
cision of the linkage values obtained from backcross 
(AaBb x aabb) and I\ (AaBb x AaBb) data 
(Mather 1936a). The segregations expected in terms 
of p in families of these types have been worked out 
in the previous chapter. For this purpose we assume, 
in the case of the F 2 , that p 1 = p 2 and so obtain an 
estimate of p from the value of P which is obtainable 
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from the data. The method of calculating the amount 
of information per plant from the formulae 



in the backcross is shown in Table 16. 

TABLE 16 (Mather 1936a) 


Class 

Coupling 

Repulsion 

m ip 

dp 

dm 

m i p 

dp 

AB/ab 

Ab/ab 

aB/ab 

ab/ab 

4 2a-,) 

ip * i 

ip * T P 

iv i i- 

2 p 

i(i - P ) - i _1_ 

2(i — p) 

i(l - J3) - i 1 

2(1 -p) 

iP i i 

2p 

1 0 1 

P( 1 -P) 

Total 

1 o 1 

p( i - p ) 


The first column gives the class genotype and it 
should be noted that each genotype is, in the usual 
case, distinguishable phenotypically. The second 
column gives the expectation of each class in terms 
of p and the third gives the first differential of the 
expectation with respect to p. From the values in 
the second and third columns it is easy to calculate 
the amount of information as shown in the fourth 
column. These class amounts are summed and give 
the value of the average amount of information 
per individual in a backcross family. The coupling 
and repulsion phases are considered separately, but 
it will be seen that, as might be expected, they 
eventually give the same value for i p . 
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The value of i p from F 2 data depends on how com- 
pletely the family is classified. We may conveniently 
recognize three types of classification. First of all, 
classification may be complete. This will, even 
under the most favourable circumstances, involve the 
separation of the doubly heterozygous class (AaBb) 
into those in the coupling and repulsion phases by 
the use of progeny tests. If this is done the classes 
are ten in number, with the expectations and amounts 
of information concerning p as shown in Table 17. 
This table is laid out precisely as the previous one. 
Note that again coupling and repulsion yield the 
same values for i p . It will be seen that a completely 
classified F 2 gives twice as much information about 
p as does a backcross, which is not surprising when 
it is remembered that an F 2 gives information about 
recombination in both male and female gameto- 
genesis. 


TABLE 17 (Mather 1936a) 


Coupling 


Repulsion 


Class 


AB/AB 
AB/A b 

AB/aB 

AB/ab 

Ab/aB 

Ab/Ab 

Ab/ab 

aB/aB 

aB/ab 

ab/ab 



dm 

ip 

rn 

dm 

ip 


dp 

dp 

Mi -pY 

-Ml ~P) 

1 

Ip* 

\P 

1 

ta(i-p) 

Ml-2p) 

M1-2J)) 8 

p(l-p) 

Mi-P) 

Ml- 2 p) 

Ml-2p) 8 
p(l~p ) 

ip(l-p) 

Ml -2 p) 

Ml-2 p) 2 

p(l-p) 

Ml-P) 

4(1—2 v) 

Ml-2 pY 
P(1~P) 

1(1 -pY 

(1 P) 

2 

±p* 

P 

2 

ip 2 

V 

2 

Ml ~P) 2 

-a-p) 

2 

iP 2 

Iv 

1 

Ml ~PY 

~Ui~P) 

1 

i \P(1-P) 

MI-2 p) 

Ml-22)) 8 

p(l-p) 

\p(l-p) 

Ml-2 P) 

Ml-2 pY 
Pd~P) 

Ip 2 

Ip 

1 

Ml ~pY 

-Ml ~P) 

1 

Mp(l ~p) 

Ml -2 p) 

Ml-2p) 8 

p(l-p) 

tP(l~P) Ml-22)) 

Ml-2p) a 

p(l~p) 

Hi ~vY 


1 

iP 2 

ip 

1 

l 

fy; 

0 

2 

l 

0 

2 

V(1 ~P) 

P(l-P) 
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The second type of classification that it is con- 
venient to consider is that which arises when factors 
A, a and B,b both show incomplete dominance, i.e. 
when the three possible combinations of the two 
allelomorphs at one locus are recognizable pheno- 
typieally. It is assumed that classification for one 
factor is independent of the allelomorphs present at 
the other locus, i.e. that the two factors are inde- 
pendent in expression. Such classification will give 
nine phenotypic classes of which one, AaBb, will 
comprise two distinct genotypes, the coupling and 
repulsion heterozygotes. This class will be composite 
and have the expectation |(1 — 2p + 2p 2 ) which 
2(1 — 2 w ) 2 

contributes - — — -±-i_ to the value of L. This 

1 — dtp -f- dp c 

contribution replaces those of 2 and 2 which are 
made by the separated coupling and repulsion double 
heterozygotes. Hence the value of i p obtained from 
an F 9 classified in this manner is : 


2 4 , 2(1 - 2 p)* 

p{ 1 — p) '1 — 2 p -f 2 p 2 

2(1 - 3 p + 3 p 2 ) 

OT p{\ - p)(l - 2p + 2p«) 

Finally we may consider the commonest case of 
all, that of complete dominance of A over its allelo- 
morph a and of B over b. There are then four 
phenotypic classes in the F 2 of which three contain 
more than one genotypic class. The values of the 
contributions of these classes to i p are worked out 
in Table 18. 

It will be seen that coupling and repulsion F 2 do 
not yield equal amounts of information concerning 
when classification is of this type. 

We can now compare the efficiencies of the back- 
cross and F a of varying degrees of classification, for 
the calculation of linkage values. For this purpose 
it is best to take the amount of information given by 
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TABLE 18 (Mather 1936a) 



Coupling 

Repulsion 

Class 

m 

dm 

lp 

dp 

m 

dm 

dp 

ip 

AB 

£(3-2 P+P 2 ) 

-d-p) (i -py 

£(24 -p*) 

Ip 

P 3 

2 Z-2p+p* 

2 + p* 

Ab\ 
aB J 

£(2 p-p 2 ) 

i-p *v-pr 

2 p-p* 

id ~P*) 

-P 

2p* 

l-p 2 

ab 

+ 

1 

4$ 

~(1 -P) x 

2 


iP 

1 


1 

n 2(3-4p+2p*) 

i 

0 

2(14-2 p*) 



p(2-p)(3-2p+p*) 




the backcross as standard because in this way infinite 
values of i p are avoided. With this standard the 
relative values of the different types of data are as 
shown in Table 19. 


Backcross . 

F 2 completely classified 
F 2 incomplete dominance 

F 2 complete dominance 


Coupling 


1 

2 

2(1 - 3p -f 3p 2 ) 

1-2 p + 2 p 2 

2(1 - y)(3 — 4- 2 P 2 ; 

(2 - j>)(3 - 2p + p 2 ) 


Repulsion. 


2p(l + 2j> 2 ) 

(2 + f> 2 )(l +J>) 


These relative values are dependent on p itself. 
Consequently a clear idea of the meaning of these 
values will be obtained by plotting the value against 
p in the form of a graph. Tor this purpose repulsion 
is considered to be an extension of coupling. Thus 
p = 0-3 in repulsion may be plotted as p = 0-7 
in coupling. Tig. 1 is then obtained. 

This Tig. 1 is instructive in a number of ways. 
In the first place it is easy to see that in the case of 
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two incompletely dominant factors the F 2 is as good 
as the backcross for the detection of linkage, i.e. 
the detection of deviation from p = 0-5. For the 
measurement of linkage values, particularly when 
the recombination value is small in either phase, 
this F 2 is better than the backcross in that it gives 
more information about p and so a more precise 



COUPLING I REPULSION 
Fig. 1 (after Mather 1936a) 

estimate of the recombination fraction. The case 
of factors having one allelomorph completely domi- 
nant over the other is, however, very different. In 
close coupling the F 2 is almost as good as the back- 
cross, but in close repulsion this is far from the case. 
The backcross is then vastly better for the estimation 
of p. Where there is no linkage, a case important 
in that this is the hypothesis tested in order to detect 
linkage, the F a has 4/9 of the value of the backcross. 
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It is thus not so good for the detection of loose link- 
ages. In general an F 2 of this the most usual type 
is much less efficient than the backcross except for 
the single case of close coupling. So it is not to be 
recommended, when the alternatives are of equal 
practical ease. 

In this way, given the behaviour of the two single 
factors, it is possible to decide on (a) the. best type 
of family for the detection of linkage and ( b ) the 
best type of family for its estimation. Practical 
considerations also enter into the question, e.g. the 
ease of backcrossing as compared with inbreeding is 
an important consideration in plant genetics. These 
considerations may to some extent set off the statis- 
tical advantages of a given type of data, but the 
experimenter, knowing his crop, will be able to form 
a fairly accurate estimate of the relative importance 
of the various considerations and will be able to 
reach a confident conclusion as to the best method of 
tackl in g the particular problem at hand. 

Pig. 1 also illustrates another very important 
point, that of the loss of information resulting from 
incomplete classification. Any F a contains twice as 
much information about the recombination fraction 
as a backcross, but the limitations of classification 
result in a certain loss, which, in the case of com- 
pletely dominant genes in close repulsion, may 
amount to an extremely large proportion of the 
whole. It is very clear that data should be as com- 
pletely classified as is immediately possible in order 
to reduce this loss to a minimum. Where the further 
classification involves progeny tests, as it would in 
F 2 families, the number of plants and the labour may 
or may not be such as to render the extra classifica- 
tion unprofitable as compared with growing further 
families of a similar kind. The policy to be adopted 
with respect to growing F 3 s to test the F 2 individual 
genotypes or growing further F 2 families is capable 
of exact treatment by the calculation of quantities 
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of information. This has been done in some detail 
by Immer (1934) and Mather (1936) and need not 
be treated here. It is sufficient to say that with 
a repulsion F 2 it is profitable to test the genotypes 
of the singly dominant classes (Ab and aB) when p 
is less than 0-08, and to test the genotypes of the 
doubly dominant class (AB) whenp is less than 0*22. 
Thus the use of this method allows of specification 
of the classes which can be tested with profit. It 
allows of very precise planning of such experiments. 

Ex. 11. The previous example was a consideration 
of a relatively familiar problem, but one of the great 
advantages of the method of approach developed 
above is that it helps to clarify policy when unusual 
genetical situations are encountered. Reference to 
various papers will usually provide all the information 
and experience necessary for reaching a decision in 
connexion with the more ordinary genetical situa- 
tions, but this is not true of some other less usual 
circumstances. The experimenter is then forced 
to deal with the situation unaided by previous 
experience. 

As an example let us consider the case of linkage 
between one gene and another which is a member of 
a pair of complementary factors (Hutchinson 1929). 
If these genes are respectively A, a and B,b, then the 
expression of the B,b difference is dependent on the 
presence of one or other of the two allelomorphs of 
a third factor, C,c. More precisely the genotypes 
Bcc, bbC and bbcc are phenotypically alike. In 
order to measure the linkage between A, a and B,b 
it is, of course, necessary to raise the double hetero- 
zygote AaBb, It will usually be the case that the 
individual is also heterozygous for C,c as it is not 
easy to tell which of the two complementary factors 
is involved in the linkage. When the heterozygote 
is of this type, AaBbCc, several possible crosses are 
open to the experimenter. He may cross to a stock 
recessive for a and for one of the complementary 
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factors while being homozygous dominant for the 
other complementary. 

There will be two such crosses possible, AaBbCc x 
aabbCC, and AaBbCc x aaBBcc. The former 
will yield extremely good data from which the linkage 
can be estimated. The latter will show no immediate 
segregation of B and b and so will be useless or 
nearly so, from the point of view of linkage estima- 
tion. If it is not known which of the two comple- 
mentaries is linked to A, a the probability of any 
cross of this type being that which will give useful 
linkage data is 

Another line of policy is to cross the triple hetero- 
zygote to a triple recessive, AaBbCc x aabbcc. 
The classification for B and b will be incomplete 
owing to the segregation of C and c. The classi- 
fication for A and a is, however, complete. 

The final possibility is that of selfing or inbreeding 
the triple heterozygotes. Classification for A and a 
will be incomplete, but there will be less disturbance 
of the B,b classification as only £ of the progeny 
will have cc, which renders the B,b difference 
undetectable. 

There are other possibilities, of course, but these 
seem to be the most likely to arise in practice. The 
problem then resolves itself into one of choosing 
which of three types of cross to use when each type 
has its peculiar disadvantages. Assume that p is 
the same in male and female gametogenesis and then 
it can be shown that the expectations of the four 
scorable phenotypic classes relevant to the linkage 
in the three types of cross are: 


TABLE 20 


AB(C) 

n % f AaBbCc x aabbCC £( 1 -p) 

1 AaBbCc x aaBBcc £ 

(2) AaBbCc x aabbcc £(1 -p) 

(3) AaBbCc x AaBbCc &(6 -F3P) 

where P = (1 — p) 2 . 


Ab(C) 
+ A(c) 
\P 
l 

IP 


aB(C) 

hP 

4 

£(i +p) 


ab(C) 
+ a(c) 
1(1 

1 ( 2 -£) 


tb( 6 ~3P) A(3-3P) iii(l -f 3P) 
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The expected segregations given above are those for 
the coupling phase. Similar expressions for linkage 
in the repulsion phase may be formulated by sub- 
stituting 1 — p for p in these expressions. 

Having the expected frequencies for each class in 
the various types of family we are now in a position 
to work out the values of i p in terms of p. These are 

1 /dm 
my dp 

the previous example. We then find the following 
values : 

rAaRbCc x aabbCC ip — — ~~r\ r 

(1) j p(i-p) 

|AaBbCc x aaRBcc i v = 0 

(2) AaBbCc x aabbcc 

I -f- 2p — 2 p 2 

^ = 2p(l-jj)(l+j))(2-^) 

(3) AaBbCc x AaBbCc 

3P(5 + 2 P - 4 P 2 ) 
lp ~ (2 + P)(l - P)( 2 - P)( 1 + 3P) 
where P = (1 — p) 2 . 

Taking the first type of family, the backcross giving 
completely classifiable segregation of the linked 
factors, as standard the following relative values of 
the families are obtained. 

< 

m 1 + 2 p- 2p 2 
' ; 2(l+p)(2-p) 

( 3P(5 + 2P-4P 2 ) \/ l -p \ 

( > \(2 + P)(2 - P)( 1 + 3P ) J \2 - ^ 

The values of the different amounts of information 
may be plotted against the value of p, taking repul- 
sion as an extension of coupling, as previously. 



obtained from the formula 


1 / ip — S 
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Fig- 2 is then obtained. This figure supplies all the 
information necessary for our purpose. In the first 
place it is seen that the first type of family is very 
much more informative than any others. The rela- 
tive values of the third and fourth types of family, 
the complete backcross and the F 2 , change over the 
range p = 0-0 in coupling to p = 0-0 in repulsion. 
Sometimes the former is better than the latter, and 



COUPLING I REPULSION 
Fig. 2 

sometimes the reverse is the case. The important 
point is, however, that the cross of type 1 , AaBbCc x 
aabbCC, is always at least 2*5 times as informative 
as the better of the other types and has usually an 
even greater value than this. Since it is usually 
impossible to draw a distinction between crosses of 
types 1 and 2, owing to the lack of knowledge as to 
which complementary factor is linked with Aa, we 
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must discount the value of cross 1 by half its value. 
It is clear, however, that, no matter what linkage 
value is concerned or whether the problem is one of 
detection rather than estimation, the most profitable 
policy is to grow equal numbers of progenies from 
crosses of types 1 and 2, AaBbCc x aabbCC and 
AaBbCc x aaBBcc. One cross will give no imme- 
diate information about linkage, but since the other 
supplies more than twice *the information of the 
better alternative, the total information obtained by 
this procedure will still be greater than that which 
can be obtained in any other way. This result is 
certainly not one that could have been easily fore- 
seen. It is in many ways novel to adopt a policy 
that involves the considered wastage of half the 
individuals produced. In general such a policy 
would probably not be the best, but in this particular 
case the limitations imposed on classification by the 
factor interaction are such that one type of cross, 
itself distinguishable from a useless alternative, has 
such a preponderant value that discarding half the 
progenies is justified. The value of planning the 
linkage experiments is here demonstrated in a most 
striking way. 



CHAPTER VII 


COMBINED ESTIMATION AND TESTING 
HETEROGENEITY 

17. COMBINED ESTIMATION 

O NLY simple problems of estimation have been 
considered up to the present. These have con- 
sisted of estimation from single families of given 
types, in which connexion it has been shown that the 
method of maximum likelihood has a number of 
advantages. This method of estimation is also of 
value in the solution of two somewhat different 
problems, viz. those of arriving at the best estimate 
of a parameter when data of several different k inds 
are available, and of testing the homogeneity of such 
aggregates of data (Mather 1935). 

Let us first consider the question of combined esti- 
mation. This is well illustrated by the estimation of 
the simplex index of separation in autotetraploid 
segregations. In organisms of this type there is no 
simple expectation for single factor segregations and, 
in order to describe the segregation of a factor, it is 
necessary to calculate the value of the parameter 
named above (see Mather, 19366). The expected 
gametic segregation of a simplex autotetraploid, i.e. 
one whose constitution is Aaaa, is A(AA or Aa) 
j(4 — a) : aa }(4 -j- a) where a is the simplex index 
of separation. On selling, one expects a segregation of 

A jj( 48 — 8a — a 2 ) : a ^j-(16 + 8a + a 2 ) 

Ex. 12. Sansome (quoted by Mather 19366) has 
69 
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observed the following segregations for the factor 
R,r in simplex antotetraploids of the tomato. 

TABLE 20 

R r Total 

(Aaaa x aaaa) Backcross 48 67 115 

(Aaaa x Aaaa) F 2 . 605 221 826 

The former is clearly the gametic segregation of 
the simplex individual and the latter is the segrega- 
tion to be obtained from selfing the same plant. 
Hence the expectations are, as given above. 

TABLE 21 

R r Total 

Backcross . £(4 — a) £(4 + a) 1 

F 2 . . *(48 - 8a - a 2 ) *(16 + 8a + a 2 ) 1 

What is the best estimate of a that can be obtained 
from these data % 

The likelihoods of obtaining such segregations 
separately are, following the argument given in 
Chapter V, 

Backcross £^[4(4 - a)] 45 [£(4 + a)] 67 
F, . . O 2 [eV(48-8a-a 2 )] 60 ^[eV(16+8a+a 2 )] 224 

The likelihood of obtaining these two segregations 
jointly is the product of the two individual likeli- 
hoods. Then the logarithm of the joint likelihood 
will be given by the sum of the individual logarithm 
likelihood expressions, i.e. will be : 

L = 48 log (4 — a) + 65 log (4 + a) + 

605 log (48 - 8a - a 2 ) + 221 log (16 + 8a + a 2 ) 

The maximum likelihood estimate of a will be 
obtained by maximizing this summed logarithm like- 
lihood expression with respect to a. Differentiating 
and equating to zero we obtain : 

. dL _ __ 48 65 

den 4 — a 4 + a 
605(8 -f- 2a) 221(8 + 2a) 

48 — 8a — a 2 ‘ 16 + 8a + a 2 ° 
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The solution of this equation is not difficult to arrive 
at by algebraic methods, but it will serve as an 
example of the alternative process of solution by 
arithmetic trial and interpolation. This latter 
method is of great value where there exists no easy 
algebraic method of approach. 

As a first approximation to the solution of the 
equation put a = 0*20. The value of the left side 
of the equation may then be calculated as shown in 
Table 22. It will be seen that this value is negative 
and so we have chosen too high a value for a. Then 
repeat the calculation using a = 0*10. This value is 
clearly too low for a. We then make a linear inter- 
polation between these values of cc, and find that the 
second approximation to the true value is a = 0*17. 
Trial of this value shows it to be somewhat too small 
and so 0*18 is next tried. This is also found to be a 
trifle too small, so showing that our linear interpola- 
tion was not strictly correct for these data. How- 
ever, on trying 0-181 we obtain a negative value for 
the left side of the equation and so can again inter- 
polate between 0-18 and 0*181. This gives us 0-1803 
as a third approximation to the true value of a. 
The value arrived at by algebraic solution of the 
equations agrees with this value to four decimal 


TABLE 22 


a 

0-20 

0*10 

0*17 

0*18 

0*181 

48 

4— a 

-12*0316 

-12*3077 

-12*5326 

-12*5654 

-12*5687 

65 

4+a 

14*7727 

15*8537 

15*5875 

15*5502 

15*5405 

605(8 + 2a) 

-109*6204 

-105*1282 

-108*2510 

1 

h- 1 

O 

GO 

+ 

© 

-108*7509 

48— 8a— a 2 
221(8 + 2a) 

105*2381 

107*8049 

105*9952 

105*7416 


16+8a+a 2 

105*7163 

Total . . 

-2*9429 

6*2227 

0*7991 

0*0210 

-0*0568 
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places. It may be noted that although the first inter- 
polation was somewhat inaccurate owing to the 
wrong assumption of a straight line relation between 
the values of a and the expression, the second inter- 
polation was more accurate. In general, the closer 
the limits between which interpolation is made, the 
more accurate the result. Further interpolation 
could be tried to obtain the value of oc even more 
accurately ; but not more than four decimal places 
are warranted by the data in this case and so further 
calculation would be wasted. 

The method of arithmetic approximation has 
another great advantage in that it automatically 
leads to an estimate of the variance of a. It will be 
remembered that J a , the inverse of F a , can be obtained 
from the second differential of the logarithm likelihood 
expression with respect to a. In other words, 7 a is 
the rate of change on oc of the maximum likelihood 
expression, which is itself the first differential with 
respect to a. Now when 

a = 0-180 -r~= 0-0210 

da 

J T 

and when a = 0-181 = — 0-0568 

aa 

Hence a change of 0-001 in oc results in a change of 

dL 

0*0778 in the value of Hence the rate of change 

r dL . . . 0-0778 . A1 

of — on a m this region is or, m other words, 

J a = 77-8. 

If we calculated I a from the difference of the maxi- 
mum likelihood values when a = 0-10 and 0-20, the 
result would be somewhat smaller than that obtained 
above because the rate of change decreases as the 
approximation to the value of a becomes coarser. 
Consequently, the closest approximations to the true 
value should be used in applying this empirical 
method of obtaining quantities of information. 
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Having obtained I a — 7 7-8 we find that V a 
= 0*01284 and that s a = VV a = 0*113. 

It will be noted that the value of I a obtained in this 
way is not precisely the same as that obtained from 
the use of the formulae 



The value obtained arithmetically is the actual 
amount of information present in this particular body 
of data. The value yielded by the formulae quoted 
is the mean amount of information to be expected 
from a large number of families of this kind and size. 
In the present case the mean value of I a is 78*1 or 
slightly more than the value obtained arithmetically. 
Either value may be used for the purpose of esti- 
mating the variance of the parameter as they never 
differ by much ; but the expected mean amount of 
information should always be employed in planning 
experiments as shown in the last chapter. 

18. TESTING HETEROGENEITY 

The other type of problem to which the method of 
maximum likelihood is adapted is that of the detec- 
tion of heterogeneity between different bodies of data 
concerning the same parameter. Its use in this 
respect may also be illustrated by example (Mather 
1935). 

Ex. 13. Bateson (1909) records segregations for 
the two genes purple-red flower colour and long- 
round pollen in the Sweet Pea. A family showed the 
following segregation : 

Purple Long Purple Hound Red Long Red Round 
296 19 27 85 

indicating linkage in the coupling phase. An E 3 also 
gave evidence of linkage in coupling in the segregation 

Purple Long Purple Round Red Long Red Round 

583 26 24 170 
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Do these two sets of data agree in showing the same 
recombination fraction for the two genes 1 

This question may he approached in several ways. 
First, we may calculate the value of P(= (1 — 
and its variance for each set of data separately and 
test the significance of the difference of the linkage 

d P — P 6 
values by using the formula - = - 7 1 -g— Thp 

5 VF Pi +Fp s 

numerator of this fraction is the difference between 
the two values of P and the denominator is the esti- 
mated standard error of this difference. Using the 
methods of Chapter V we find 

d = P,-P 2 = 0*088636 

s = Vf^+ Fp. = 0-033786 

and the difference divided by its standard error is 
2*6235. 

A table of probabilities of normal deviates (Table 1) 
shows this value to have a probability of just less 
than 1 per cent. This value also corresponds to a 
X 2 of 6*8828, which is obtained by squaring the value 
2*6235. This method of analysis suffers from a 
serious disadvantage. The two variances are cal- 
culated on the bases of the two separate estimates 
|pf P. We desire to test the hypothesis that these 
Plata agree in showing one value of P, so the variances 
should have been calculated on the basis of the best 
combined estimate of P. Hence the values of the 
variances reached above are not correct. In fact, 
one is too large and the other too small. The two 
discrepancies, though of opposite sign, are not of 
necessity equal in magnitude and will not balance. 

Thus we are led to approach the problem as one of 
combined estimation. Assuming homogeneity, we 
can add the two sets of data together and estimate P 
from the totals, obtaining 0*843047. This value may 
then be used in the formulation of expected segrega- 
tion for the two families. From the relations of 
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these expectations to the observed segregations we 
may calculate % 2 for the F 2 and F 3 families separately 
as is done in Table 23. In this table the two central 
or singly dominant classes of the family are added 
together for purposes of estimation as they have the 
same expectation in terms of P. Any difference 
between them will be dependent on the single factor 
ratios alone, and so holds no interest for us in the 
discussion of this present problem. 


TABLE 23 (Mather 1935) 


Observed (a) 

Expected 

(mn) 

[P 

= O-S43047] 

dmn 

~dP 

Discrepancy 
in Likeli- 
hood 
Equation 

ra dmn, 

L m dp\ 

Amount of 
Information 

rl [dm\ 2 ~i 

n l midPJ J 

X s 

( a — ?m) 8 

nm 

296 

303*495 

106*75 

104*114 

37*55 

0*1851 

F 2 46 

33*510 

- 213*50 

- 293*077 

1360*26 

4*6553 

S5 

89*995 

106*75 

100*825 

126*62 

0*2772 

Total 427 

j 

427*000 

0*00 

- 88*138 

1524*43 


583 

570*742 

200*75 

205*062 

70*61 

0*2633 

F 3 50 

j 63*016 

- 401*50 

- 318*570 

2558*12 

2*6885 

170 

169*242 

200*75 

201*649 

23S-12 

I 0*0034 

Total 803 

: 803*000 

0*00 

+ 88*141 

2866*85 

8*0728 


The contributions to % 2 of the two sets of families 
are 5-1176 and 2-9552, giving a total of 8-072S. This 
will have three degrees of freedom because there 
were two degrees of freedom in each family (three 
classes), but one of the total of four has been lost in 
estimating the linkage value. Consequently we are 
led to the conclusion that the deviation of the two 
families from their expectation is just significant, 
the probability of being just less than 5 per cent. 

The test may be made more sensitive, however, 
since at present it concerns not only the agreement 
of the two families with respect to the linkage value, 
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but also two superfluous degrees of freedom con- 
cerning single factor ratios in which we are not 
particularly interested. These superfluous portions of 
y ; 2 may be removed as follows. The two distinct 
values of P yielded by the families separately have 
already been found. Expectations based on these 
may be formulated and agreement between them and 
the corresponding observations may be tested by 
calculating x 2 values. We find these two x 2 values 
to be 0-0228 and 0-2397 {see, Table 24). If these 
values are subtracted from the x 2 calculated on the 
basis of the joint estimate of P, the remainder is 
concerned solely with difference between the families 
and has one degree of freedom. Since the single 
factor ratios of both genes in both families are good, 
the discrepancy, if any, between the families must be 
due to discrepancies in the values of P shown by the 
two segregations. The remainder of obtained in 
this way is 7-8101 for one degree of freedom and is 
highly significant. The two sets of data do not agree 
in the recombination fractions that they show. 

TABLE 24 


Observed 

(a) 

Expected 

[P = 0-78557] 
(nm) 

Deviation 
(a — nm) 

- 1-354 . 

4- 0-208 

4 1-146 

X 2 

r (a — nm) z ~[ 

L nm J 

296 

F„ 46 

85 

297-354 

45-792 

83-854 

0-0062 

0-0010 

0-0156 

Total 427 

427-000 

0-000 

0-0228 

583 

576-987 

+ 6-013 

0-0627 

F 3 50 

50-526 

- 0-520 

0*0055 

170 

175-487 

- 5-487 

0-1717 

S03 

803-000 

0-000 

0-2399 


The x 2 obtained in this way is much more significant 
than that obtained by the first method discussed. 
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This difference well exemplifies the advantages of 
using an exact method of approach. Differences are 
likely to be overlooked if an insensitive test is applied. 

The exact method of calculating the heterogeneity 
fs is very long but may be shortened considerably 
by using the formula 


= <2 2 + r 

i P 


2+ P 


' a S , fl'4 


where P is the joint estimate of the function of the 
recombination fraction. Q 2 is that portion of y 2 
which is accounted for by the deviations of the single 
factor ratios from their expectations (Fisher 1936a) 

Consequently the portion JL |^2^7p — ^ 

is a y 2 which is dependent on the discrepancy between 
the calculated joint value of P and the value afforded 
by the body of data in question. It is obtained by 
squaring the deviation from zero of the maximum 
likelihood expression and dividing by the amount 
of information concerning P in the particular bodv of 
data. Thus the y 2 for heterogeneity of the linkage 
values is given by the formula 



2 + P 


a 2 + | ^4] 2 

1 ~ P P ) _ 


which may be written y 2 = S 



summation proceeding over all bodies of data. 

The calculation of the heterogeneity in the example 
under consideration is shown in Table 23. The first 
column shows the observed segregations. In the 
second column is the value expected. The third 
column is found from the differential of the expected 
frequencies. Thus for the first row of the table 

m = -(2 + P) and so ^ The fourth column, 

dP 4 

the discrepancy in the likelihood expression, is found 
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arithmetically row by row by dividing the product 
of the values in columns one and three by the value 
in column two. The fifth column, the contribution 
to the amount of information, is obtained by squaring 
the value in column three and dividing by that in 
column two. The deviations of the two families 
from the maximum likelihood solution is found by 
summation in column four. It will be seen that the 
two deviations are opposite in sign and almost equal, 
so showing that the combined value of P has been 
properly estimated. The amount of information 
from each family is found by similar summation in 
column five. Then the contribution of each family 
to % 2 is found by squaring the family total in column 
four, the deviation from the maximum likelihood 
solution, and dividing by the family amount of 
information from column five. The two contribu- 
tions obtained in this way are 5*0959 and 2*7099, 
giving a of 7*8058 for one degree of freedom. 
This is very close to the value obtained by the previous 
method of calculation. Thus the conclusion that the 
families do not agree in the values of P that they 
show may be arrived at conveniently and quickly 
by this method, which involves only the estimation 
of the best joint-statistic of P. 

Ex. 14. As a more complex example of the use of 
combined estimation in arriving at joint estimates 
and in testing heterogeneity we may consider the 
data of Jenkins (1927) on recombination between 
the factors Y,y and Wx,wx in maize. This author 
has three types of family which give information 
about the recombination between these genes, as 
shown in Table 25. 

The expectations for the various classes in each 
family are shown underneath the observed 
numbers. 

We are concerned to know whether the two single 
factor ratios are in keeping with Mendelian expecta- 
tion, what the best estimate of the recombination 
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TABLE 25 


Cross 

Y Wx 

Ywx 

y Wx 

y wx 

Total 

Backcross 




coupling . 

397 

297 

289 

412 

1,395 



n 

2 P 

n 

2 P 

f(l ~P) 


Backcross 






repulsion . 

78 

136 

120 

80 

414 


n 

U.. 


n, 



2 p 

gC - 3>) 

2< J - P) 

2 P 


Single 






backcross 
repulsion . 

461 

161 

515 

130 

1,267 


^(1 + P) 

id-*) 

+ 9 ) 

Yi 

4 P 



fraction is, and whether the data are homogeneous 
for the recombination fraction. 

In each family, segregation for the Y,y factor is 
that of a backcross with expectation of 1 : 1. The 
analysis of this segregation is then easy and is done 
by the methods developed in Chapter II. The 
following analysis of % 2 is obtained : 

X 8 D.F. E 

Deviation . . 0*0832 1 0*80 - 0*70 

Heterogeneity . . 0*8428 2 0*70 — 0*50 

Total . . . 0*9260 _3 

The case of the factor Wx,wx is not, however, so 
simple. The first two sets of data are both back- 
crosses for this factor and may be expected to show 
a 1 : 1 segregation. The third set of data is, on the 
other hand, an F 2 for this factor and will show a 
3 : 1 segregation. How may the joint deviation 
from expectation be tested ? 

A method for doing this test has been developed 
by Mather (1937). It is based on joint estimation 
by the method of maximum likelihood. 

Let the frequency of gametes carrying the reces- 
sive allelomorph, wx, be x. Then the gametic out- 
put of a heterozygous, Wx,wx, plant will be Wx 
1 — x : wx x. On backcrossing, the observed segre- 
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gation will be the same as the gametic segregation of 
the heterozygote. On selfing a heterozygote we shall 
obtain a segregation of Wx 1 ~ x 2 : wx £ 2 . The 
observed frequencies are in the backcrosses jointly, 
Wx 884 : wx 925 and in the F 2 Wx 976 : wx 29L 
The joint logarithm likelihood expression is 
L = 884 log (1 — x) -|- 925 log x + 976 log (1 —x 2 ) 
+ 291 log x 2 

and the equation for the joint estimation of x 
dL _ 884 925 976 x 2z 291 x 2x 

dx ~~ 1 — x x 1 — x 2 x 2 ~~ ® 

We are, however, at present solely concerned with the 
significance of the deviation of the segregations from 
the simple Mendelian estimation of x = This 
value for x may be substituted in the maximum 
likelihood equation and the deviation of the data 
from the maximum likelihood equation obtained. 
It is found to be 

2(925 - 884) - | (976 - 3 x 291) 
i.e. - 55-3. 

We next require the amount of information about 
x yielded jointly by the joint data. The methods of 
Chapter VI lead us to expect that the backcross will 


give units of information about the value 

of x and that the F 2 will similarly give — ^ — units 

of information. Then substituting the expected 
value of x = \ the total information becomes 
4(884 + 925) + 4^(976 + 291) i.e. 139931*1 Thus 
we now have both the deviation of the maximum 


likelihood expression from the expected answer and 
the amount of information about the parameter x. 
We can calculate a x 2 value from these results by the 

])2 

use of the formula x 2 = ~j where D is the joint 


deviation from zero and I the joint information. 
This is found to give x 2 = 0*2188 and will clearly 
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have one degree of freedom, as its value can be made 
indefinitely small by the adjustment of statistic x. 
(N.B . — This is a deviation, not a heterogeneity, y~ 
as we have not used an estimated value of #.) 

The analysis of for this factor may now be com- 
pleted in the usual manner. Each family provides 
a value of calculated by the formulae given in 
Chapter II. The sum of these three ^ 2 s is found 
to be 3*9529, for, of course, three degrees of freedom. 
Erom this total the % 2 for the joint deviation from 
the Mendelian expectation as calculated above may 
be deducted to leave a remainder for heterogeneity 
between the families. The complete analysis is : 

x 2 d.f. p. 

Deviation . . 0*2188 I 0-70 — 0*50 

Heterogeneity . . 3*7341 2 0*20 — 0*10 

Total . . . 3 *9529 3^ 

It is clear from the % 2 analyses for the two single 
factor ratios that these segregations are quite in 
keeping with simple expectation and are, also, failing 
to show any signs of inhomogeneity. So we may 
proceed to the estimation of the recombination 
fraction. 

Table 25 gives the observed segregations and also 
the expectations of the various classes in terms of 
the recombination fraction p. It is then easy to 
write down the joint logarithm likelihood expression 
for the three sets of data. In each set of data those 
classes with the same expectations in terms of p are 
added together. 

L = 809 log (1 - p) -f 586 log p + 256 log (1 - p) 

4- 158 log p -f 461 log (1 + p) + 161 log (1 - p) 

+ 515 log {2-2?) +■ 130 log p 

The sources of the terms of this expression are 
obvious. The first two are from the coupling back- 
cross, the second pair from the repulsion backcross 
and the last four terms from the single backcross. 
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Maximizing by differentiation with respect to p 
gives as the equation of estimation : 

dL _ 809 _ 586 _ 256 158 461 

dp ~~ 1 — p p 1 — p p 1 + p 

161 _ 515 130 

1 — p 2 — p p ~ 

This equation is most easily solved by arithmetic 
approximation, as in Ex. 12. By this means it is 
found that the value of p is 0-4162 to four decimal 
places (Table 26). 

There is now left the question of heterogeneity of 
the linkage data. The heterogeneity is easily cal- 
culated from the figures already obtained in solving 
the equation of estimation of p. We first of all 
substitute 0-4162 forp in the equation of estimation. 
This enables us to find the deviations (D) from zero 
of the three separate parts of this equation, i.e., those 
parts coming from the three different sets of data, 
when the best joint estimate of p is used. We next 
calculate the amounts of information (I) about p 
yielded by the three separate sets of data. This is 
done arithmetically by determining the rate of 
change of the corresponding portions of the maxi- 
mum likelihood expression for unit change in p in the 
neighbourhood of the parameter’s best fitting value 
(cf. Ex. 12). Eor example, in the coupling backcross 
D = 1407-97693 - 1385-78454 = 22-22839 and 
1 = [(1408-65385 - 1385-27397) - (1405-27578 
- 1387-65009)] x 100 = 5754-19 
(see Table 26). may then be calculated for each 

set of data from the formula x 2 = The 


results are given in Table 27. 


Backcross coupling 
„ repulsion 
Single backcross . 


TABLE 27 


D 

7 


22*22839 

5754-19 

0-0859 

- 58*88116 

1662-71 

2-0851 

36-92212 

1657-41 

0-8225 

0-26935 

9074-31 

2-9935 
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The total of the values is 2*9935. This will cor- 
respond to two degrees of freedom because, of the 
three provided by the three families, one is used up 
in calculating the best fitting value of p. The proba- 
bility of obtaining as large or larger value of ^ by 
chance is 0*30 — 0*20 and so the data may be con- 
sidered to be homogeneous. 

Thus the three families all agree in showing good 
single factor segregations and a recombination value 
of 41*62 ± 1*05 per cent. 

19. INCOMPLETE MANIFESTATION OF A CHARACTEB 

The testing of heterogeneity by combined estima- 
tion has another important application, viz. to the 
testing of hypotheses involving segregations for a 
character which shows incomplete manifestation. 
In such cases it is usual to grow F 3 progenies to test 
certain of the F 2 individuals for the presence of the 
character which may not have been shown pheno- 
typically. The problem is that of incorporating such 
test progenies in the analysis. The solution of this 
problem has been described by Smith (1937). 

Ex. 15. The actual example used by Smith is that 
of ‘ fired 5 in certain wheat varieties. This character 
is apparently controlled by three unlinked comple- 
mentary genes, i.e. ABC plants will be fired, but 
ABc, AbC, aBC, Abe, aBc, abC, and abc plants 
will be normal. Then in the F 2 raised from triple 
heterozygotes, the segregation expected is one of 27 
fired to 37 normal. Actually it is found that fired 
plants may sometimes look normal, and so a number 
of normal-looking F 2 individuals were tested by grow- 
ing F 3 progenies from them. If the F 2 , though geno- 
typically fired, was phenotypically normal, this will 
be betrayed by the occurrence of fired individuals in 
the F 3 . Genotypically normal individuals will fail 
to give fired members in the F 3 . 

The actual segregations observed were, in the F 2 , 
raised from triple heterozygotes, 161 fired : 276 nor- 
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mal. Of these normals 60 were tested by vrowino- 
F 3 progenies from them and of these 5 were found to 
be genotypically fired and 55 genotypically normal 
Are these data in agreement vrith the above 
hypothesis ? 

If the hypothesis is true we expect 27/64 of the 
F 2 to be genotypically fired, but of these some portion 
will look normal. Let us represent this portion by f 
Then we expect (27 -/)/64 to he phenotypieally 
fired and (37 + /)/64 to be phenotypically normal. 

On testing normals we expect to find ^ 

on 

to be genotypically fired and; 


37+/ ofthem 
to be true nor- 


37 -j- / 

mals. We thus have two sets of data, each giving 
an estimate of / in accordance with our hypothesis 
If the hypothesis is true we expect that these two 
estimates of / will be alike. If the data are hetero- 
geneous with respect to the / of the hypothesis, then 
the hypothesis is wrong. The problem is one of 
testing heterogeneity by joint estimation. 

The joint logarithm likelihood of the two sets of 
data is : 


L = 161 log 


27 -/ 


64 


- 276 log 


+ 55 log ; 


37 +/ 

64 

37 


■ 5 log 


/ 


37 +/ 


° 37 + / 

On summing terms involving like expressions for / 
and omitting terms independent of/ this becomes : 
L = 161 log (27 -/) + 216 log (37 +/) + 5 log / 
Differentiating and equating to zero gives 
dL_ _ 161 

df 


? + 


216 , 5 

37 +/"*"/ 


27 - / 

for the equation of estimation for /. 
This reduces to 

382/ 2 + 175/ - 4,995 = 0 
/ = 3-394254. 


and gives 
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We may now find the expectations for the classes 
in the F 2 and F„ using this best fitting value of /. 
These are set out in Table 28. 



TABLE 

28 


Class 

Observed (a) 

Expected (mn) 

\ mn ) 

Fired . . . - 

Normal . 

161 

276 

161-1830 

275-8170 

0-0002 

0-0001 

Total . 

437 

437 


Fired . . • • 

Normal . . • 

5 

55 

5-0417 

54-9583 

0*0003 

0-0001 

Total . . . • 

60 

60 

0-0007 


A y 2 is calculated for the agreement of the observed 
values with these expectations. It will have one 
decree of freedom as one has been lost m fitting /. 
The probability of obtaining as large or larger devia- 
tion is then found to be 0-98 to 0-95. The data are 
in agreement with the hypothesis. 



CHAPTER VIII 
DISTURBED SEGREGATIONS 
20. DISTURBED E 2 S 

S IMPLE formulae have been given in earlier 
chapters for the detection and est im ation of 
linkage. It is, however, clear that they are only 
accurate when the single factor segregations are good. 
The formulae for calculating in the detection of 
linkage are special to given cases ; a change in the 
segregation of one factor necessitates an entirely new 
formula. Also the simple application of the method 
of maximum likelihood will not give an estimate free 
from error arising out of disturbed single factor 
ratios. A considerably more complicated use of 
maximum likelihood could be employed for such cases. 

There are, however, methods for the detection and 
estimation of linkage with disturbed single factor 
segregations which do give accurate and trustworthy 
results. These methods do not enjoy some of the 
advantages of the methods applicable when gene 
segregation is normal, but are to be preferred to the 
earlier methods when disturbances are encountered. 
The methods for the detection of linkage, as given in 
this chapter, are of wider application than the 
methods for the estimation of linkage. 

We may conveniently divide the treatment of dis- 
turbed data into two types, that concerned with 
single families and that to be used when both coupling 
and repulsion data are to hand. 

Ex. 16. As an example of the analysis of data of 
7 87 • 
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but one type we may take a case of linkage between 
a gene and a second one which is a member of a pair 
of complementary factors. Then the character con- 
trolled by the second linked gene will not segregate 
in a 3 : 1 ratio but will give a 9 : 7 in the F 2 . As we 
know what the true segregation is we can compare 
our approximate results, obtained by treating the 
9 : 7 as a disturbed 3:1, with the true values obtained 
by using the 9 : 7 expectation. We may utilize data 
from Jenkins (1927) concerning the segregation for 
green and yellow plant colour and purple and white 
aleurone colour in maize. The former is controlled 
by a single factor and the latter by a pair of comple- 
mentary factors. In one family segregating for all 
the factors the segregation observed was : 

Green Purple Yellow Purple Green White Yellow White 
127 19 67 44 

We desire to know if there is linkage between the 
genes, assuming that the purple -white segregation is 
due to a single factor disturbed by unknown causes, 
and if linkage is found, what the recombination 
value is. 

We first note that the segregation for green : yellow 
is 194 : 63 and that this is a good 3:1 (as tested by 
% 2 ). On the other hand, the segregation for purple : 
white cannot be considered as agreeing with an 
expectation of 3:1 when tested by ^ 2 . Then we 
cannot use the ordinary method for calculating 
to test for linkage. We may, however, calculate the 
linkage f in a different manner. The data are set 
out for this purpose in a 2 x 2 table (contingency 
table) thus : 

TABLE 29 


127 

19 

146 

67 

44 

111 

194 

63 

257 1 
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The marginal totals give the single factor segrega- 
tions of the two genes. If there is no linkage between 
the genes we may expect that the frequency of any 
one class observed in the experiment is proportional 
to the corresponding marginal frequencies. Thus the 
green purple class would be expected to occur in 
mg X94 

257 ^ 257 cases * ^.ere are 257 i n the family 


we then expect 110-2101 in this class. In this way 
we could calculate the expectation for the other 

classes and then find y 2 from the formula St — j — n . 

\nmj 

This x 2 would be a test of departure from independence 
of segregation of the genes, i.e. of linkage. 

We can, however, calculate the y 2 in a much simpler 
manner, viz. by using the formula 


2 (oqi%4 a<yX$) 2 7l 

1 _ («1 + ®s)(®j + «4)(®a + ®4)(®1 + a*) 

where a l9 a 2 , &c. and n have the same meaning as in 
earlier chapters. Applying this formula we find : 


2 


2 


(127 X 44 - 67 x 19) 2 x 257 
146 X 111 X 63 X 194 


24-159 


This corresponds to one degree of freedom as, of the 
original three, two have been taken up in using the 
observed marginal totals as the best fitting segrega- 
tions for the single factors. The significance of this 
X 2 is beyond question. 

As a check on the method we may calculate the y 2 
for linkage, utilizing the knowledge that the purple - 
white segregation is one of 9:7. The formula for 
the x 2 detecting linkage in such a family, arrived at 
by the methods of Chapter VI, is 

9 (la, - 9a 2 - 21a 3 + 27a 4 ) 2 
x “ i89w 


and in this case y 2 = 23-792. The difference between 
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the linkage ^ 2 s as calculated in these two ways is 
negligible. On the other hand, if we calculate 

,, , 1 (a, - 3a 2 - 3a s + 9« 4 ) 2 , ^ 1 . 

from the formula ^ which is 

correct when both factors are giving a 3 : 1 ratio, we 
find that it is 30-361, a misleadingly high value. In 
the present case the overestimation of the significance 
of the evidence for linkage is not serious, but it is 
easy to see that it could be highly misleading in cases 
where the evidence is more doubtful. 

We may now consider the estimation of linkage. 
In place of the method of maximum likelihood, which 
cannot be used when the single factor segregation is 
poor, we may employ the product method. This 
method will in certain cases give an absolutely 
accurate estimate of linkage and in any case will 
reduce the error arising from the poor segregation as 
compared with the usual method of maximum likeli- 
hood formula for undisturbed F 2 s. 

The product formula puts — — — 1 __ p 2 ^ or 

the usual two factor F 2 . Applying it to the present 
case we find 

5588 2P + P 2 
1273 ~~ 1 - 2P + P 2 
4315P 2 - 13722P ■+ 5588 - 0 
P = 0479542 
p = l - VP = 0-3075 

This value of p is rather far from the true value 
of p as calculated when the 9:7 is treated as such 
and not as a bad 3:1. It is, however, considerably 
nearer to the correct value than is the estimate 
reached if the data are treated as a simple two factor 
F 2 by maximum likelihood (35 per cent). Thus 
although the error is not completely removed it is 
reduced. 

This example is, however, not one which shows the 


or 

Then 

and 
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product formula at its best. Let us take an artificial 
example where the ratios are disturbed by poor 
manifestation of the recessive. We may then com- 
pare the estimates of the recombination fraction 
with the value used as the basis for the construction 
of the example. 

If we take an F 2 for two linked factors with 25 per 
cent recombination in repulsion we expect a segrega- 
tion of 2*0625 AB : 0*9375 Ab : 0*9375 aB : 0*0625 ab. 
Let us further consider that the recessive bb failed 
to manifest itself in 40 per cent of cases. The 40 
per cent of the Ab and ab classes will be classified 
as AB and aB respectively. Then we expect to find 
an observable segregation of 2-4375 AB : 0*5625 Ab : 
0-9375 aB : 0-0375 ab. Calculation of the recom- 
bination fraction from these figures by the simple 
maximum likelihood F 2 formula (Chapter V) gives 
P = 0-077972 p=\/P = 0-2792. This is a devia- 
tion of 2*92 per cent from the real value of 25 per cent. 

If, on the other hand, we calculate P, and from 
it p, by the product method we obtain a much better 
result. The equation of estimation (see above) is 
0-09140625 _ 2P + P 2 
0-54140625 “ I - 2P + P 2 
and we find P = 0-70457 

p = 0-2654 

The deviation is now but half of that shown by the 
maximum likelihood estimate. This well illustrates 
the effect of the product method in minimizing such 
errors. If the disturbance had been due solely to 
poor viability of the bb classes, the product formula 
would have given an absolutely correct estimate. 

It should be remembered that when we say that 
the method of maximum likelihood gives an estimate 
showing considerable error we do not mean that the 
correct application of maximum likelihood will give 
a wrong estimate. If, in setting up the expected 
values, we can, by utilization of some hypothesis as 
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to their nature, allow for the disturbances in the 
segregation a correct estimate would be obtained. 
It is in cases where such a knowledge of the cause of 
disturbance is not possessed, so necessitating the 
employment of an approximate method, that the 
product formula is of more value. 

We may note finally that the product formula is 
fully efficient for the estimation of linkage and its 
variance is as small as that of the maximum likelihood 
estimate. These may be calculated by the methods 
of Chapters V and VI or directly from the product 
formula as shown at the end of this chapter. 

21. the exact treatment oe backcross data 

In the detection and estimation of the recom- 
bination fraction it has been supposed above that 
the disturbance in the segregation of one factor has 
not affected the segregation of the other. This may 
not be true in all cases, though in this example the 
good segregation of the first factor suggests that this 
assumption is not incorrect. In general, failure of 
manifestation of one character, due for example to 
incomplete penetrance, will give results justifying 
the use of this method. On the other hand, reduced 
viability of one factor will affect the segregation of 
anything linked to it and so the method may not be 
completely suitable. 

Where both factors are showing disturbed segrega- 
tion this method must be used with considerable 
caution. If one class is very short in numbers, so 
causing the disturbed segregations of both factors, it 
cannot be treated in this way. 

The second approach to the detection and estimation 
of linkage where gene ratios are disturbed, as devel- 
oped by Fisher (19366), is beyond these criticisms and 
is frequently applicable. It demands, however, the 
j oint use of coupling and repulsion data. This second 
method has been developed for the backcross only. 
It is not clear that it can be used uncritically for F 2 
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data, as they do not share with the backcross the 
characteristic of each phenotypic class comprising 
but one genotypic class. In this case the corrections 
for viability may not be absolutely correct. It would, 
however, certainly give an exact test for the detection 
and a first approximation to the correct treatment 
in estimation of disturbed F 2 data. 

Ex. 17 . labours (et. al. 1933 and unpublished data 
supplied to R. A. Fisher) finds the following segrega- 
tions on backcrossing Grouse Locusts ( Acridium 
arenosum ), heterozygous for the two factors W and 
My, to the double recessive. 

TABLE 30 

w my w my w My W My Total 
Repulsion . 30 70 2 24 12(5 

Coupling. . 519 119 12 349 999 

It will be seen that in both cases the w My class 
is very small as compared with any other. Neither 
W,w nor My, my is giving a good 1 : 1 segregation 
such as would be expected from a backcross. This 
is largely attributable to the shortage of w My 
animals, though it is to some extent aggravated by 
a small shortage of W My locusts. It is, however, 
clear that the disturbance of each factor is due to its 
interaction in viability with the other factor. Hence 
the methods of the previous example cannot be used. 
We must proceed by comparison of the two families. 

First let us add the w my and W My animals 
together, calling the sum A v Similarly the W my 
and w My animals are summed to give A 2 . This is 
done for the coupling and repulsion data separately. 

Then in repulsion the A 1 class is one of recombina- 
tion individuals, and A 2 comprises the parental com- 
binations. In the coupling data the reverse is, of 
course, the case. These classes should be potentially 
equal in the absence of linkage, though viability dis- 
turbance may reduce one or other of them. No 
matter what the cause of the disturbance it should 
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affect the A 1 class in coupling and A 2 class of repulsion 
equally, as they comprise the same two genotypes in 
potentially equal numbers. Then in the absence of 
linkage the ratio of A 1 to A 2 should be the same in 
both sets of data irrespective of any viability dis- 
turbance. This expected similarity provides the 
basis for the detection of linkage. The data are set 
out as in Table 31. 

TABLE 31 


Repulsion 

Coupling 


Ao 

A 

72 

54 

131 

868 

203 

922 


126 

999 

1125 


The marginal totals are found and then a y? testing 
the hypothesis that the observed four classes are 
proportional to the marginal totals, i.e. that the 
A ± ~~ A 2 subdivision is independent of the coupling- 
repulsion subdivision (as it will be in the absence of 
linkage) may be calculated. It is done by the same 
formula used for the table in the last example. 
We find 


2 (868 X 72 - 131 X 64) 2 1125 

: “ 126 x 999 x 203 x 922 


146*674 


for one degree of freedom. This is clearly of very 
great significance and there can be no doubt of the 
existence of linkage. 

We next ask the value of the recombination fraction 
between the genes. 

Now, if we imagine two hypothetical values oc l5 a 2 
of the type of A 1 and ^t 2 but undisturbed by via- 
bility differences it is clear that 


P 


and 


1 -p 

P 


A 


-P 


- 

a 2 

Oa 

a i 


for repulsion 
for coupling 


But the ratio ~ departs from the ratio — because of 

Ho oc 2 
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viability troubles. The departure from this perfect 
ratio, caused, by the inviability of some genotvpes, 
is the same in the coupling and repulsion data. ‘ We 
can thus write : 


Then for repulsion 


and for coupling 



A 

VOCx 


a 2 ~ 

cc 2 


jp 

- ^ R1 

1 

~v 

V Ali2 


_v 

_ vA C2 

T 

— p 

-4 ci 


Hence simple multiplication gives 


(_p y_ A Rl A C2 
V 1 - P/ Ar*Ac 1 


where A Rl is A t in the case of repulsion, &c. This 
equation provides a method of estimating p inde- 
pendently of the viability disturbance v. It may be 
noted here that the equation 

AigAd __ 

^iJ2^C2 

itself also derived by the above considerations, pro- 
vides an estimate of v independently of p. The 
method may be used for either purpose. 

Applying this method to the estimation of p from 
Nabour’s data we find 


Pl 54X131 -0-113101 

(1 - p) 2 72 X 868 

Then = 0-33644 

1 - p 

0*33644 

and p — = 0*2517 or 25*17 per cent. 


Our next concern is to find an estimate of the 
variance of the statistic p. There exists a simple 
formula for this purpose (cf. Fisher, 19366). The 
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derivation of this formula is given in the last section 
of this chapter. The formula itself is 
_ p 2 ( i - p ) 2 
yp - ~ h 

where h is the harmonic mean of A Rl , Ar 2 , A Cl , and 
A c 2 , i.e. 

h 4yd. m A r 2 A Cl A C2 ) 

In the present case we find h = 97*102 and 
p = 0*2517. 

Then V p = (0 ’ 25 - 17 ^ - ( |^ - 3 - = 0-0003653 
and s v = VF* = 0-01911 


= 0-0003653 


22. THE CALCULATION OF VARIANCE FORMULAE 


In the above examples certain assertions were 
made about the formulae for the variances of the 
statistics used. The methods by which such vari- 
ances are obtained may be illustrated by the deriva- 
tion of the two formulae used in the last two examples. 
First consider the question of the variance of the 
statistic P calculated from an F 2 by the product 
formula method (Fisher 1936a). This formula puts 
a x a A __ 2P + P 2 
a 2 a 2 1 — 2 P + P 2 
Then, taking logarithms, 

F = log a t + log a 4 — log a 2 — log a s 
= log (2 + P) + log P — 2 log (1 — P) 

The variance of F may be found from the general 
formula 



dF dP = dP ___ 

dcL-y ci-j' d&2 

dF = l dF_ 0 
da± al dn ~ 


Now 
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Then substituting expectation for a x &c., we find 

1(2 + P) , IP , |(1 -P) 


-V F = 
n n- 


+ 


fi- 


le 12 + P)% 


16 


P 2 


16 


(1 - P) 2 


or 

n 


4 Fj? ” 2 + P 


fi- 1 


To obtain V P from 


2 _ 2(1 + 2 P) 

1 - P P(2 4- P)(l - 


P) 


general formula V P = V F 
Chapter V, 


clFy 

dPJ 


already used in 


dF 

dP 

Then 


+ 1 _JL_ = 2(1 + 2P) 

2 + P P ^ 1 - P P(2 + P)(l - P) 

n y - P(2 + P)(1 ~ P > 

4 Kp 2(1 + 2P) 


2P(2 + P)(l - P) 
F 72,(1 +2P) 


This is also the formula for the variance of the maxi- 
mum likelihood statistic, and so the product formula 
gives a fully efficient estimate of P. 

The second example (Fisher 19366) discussed in 
this chapter utilized the estimation equation 

p 2 ___ __ 

(1 — pf ~~ A R2 A c i 0Cij 2 0Cci 

What is the variance of p % 

Using the same notation as in the example let us 

consider the simple case - = — where q = 1 — p 
q a 2 


iog| = lo g:P - log? 

*(v > g £\-l + *-l 

M 8 W p i n 


and 
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I* the amount of information about * is the inverse of 
the variance of x. 


Then 


I v ■ 

io «i 


’V&Og 




n 


or putting 

V 


= — and q = 


oc 2 


7 m' 


ai0c 2 


or 


1 , 1 

F, p = — + ; 


log ; 


a x a 2 


But we actually estimate J as the geometric mean 

f Ak, and Ac? (i.e. as the geometric mean of 
01 Aft, A ci 


and 

0Cij 2 a ^v 
Remembering that if 

. T ^f =Xx + X a 

then 7 liog£ == l(x i + i;) 

the variance of 7 log r as estimated in this manner is 
b Q 


found by 


F ** # 


= V— + — 

4yd#! A.RZ 


+ ^ + A 


s) 


- I 

" h 

where h is the harmonic mean, because the variance 
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of a sum is the sum of the variances of the two com- 




1 ^ /JY = pV 

l ' [pqj ~ h 


This is the formula utilized in the example. 



CHAPTER IX 
HUMAN GENETICS (I) 


23 . HUMAN DATA 

T HERE are two characteristics of human geneti- 
cal data that make its statistical reduction 
different from, and rather more complex than, that 
applied to normal genetical results. These are 
respectively the small size of the families produced 
by any mating and the incomplete information avail- 
able about the type of mating. These difficulties are 
overcome by the development of a correspondingly 
more elaborate statistical technique. It should be 
noticed, however, that although the statistical treat- 
ment is superficially different from that described for 
non-human material, it is characterized by certain 
fundamental similarities to the methods developed in 
the previous chapters. The following description of 
the statistical methods applicable to human material 
is not intended to be complete, but, it is hoped, will 
serve as an indication of the general line of approach 
to these special problems. More detailed analyses 
will be found in the various articles cited in the text. 

The two main statistical problems, viz. single factor 
segregation and linkage, will be considered separately 
and in that order. 

24 . SINGLE FACTOR SEGREGATIONS 

The two difficulties noted above as characteristics 
of human data are encountered immediately a con- 
sideration of single factor segregation is undertaken. 
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In the first place the smallness of the families 
invalidates the use of the test of deviation from 
the expected segregation. Where expectation in any 
class is less than 5, the calculated from that family 
departs seriously from the tabulated large sample y 2 
distribution. Hence it cannot be used as a test of 
significance in such cases. This difficulty can, of 
course, be overcome by suitable lumping in order to 
test the significance of deviations, but in testing 
heterogeneity the difficulty is felt with, full force. 
Until an easily applicable generalized y ; 2 is available, 
this test is ruled out of general use. 

The second characteristic, that of incomplete infor- 
mation about the type of mating, is perhaps even 
more troublesome but can be overcome. Consider 
the case of a rare recessive character in a population. 
With random mating the frequencies of the genotypes 
AA, Aa and aa will be (1 — p) 2 : 2p(l — p) : p 2 
where p is the proportion of gametes carrying the 
recessive allelomorph. When p is small p 2 is very 
small (e.g. p = 0-01 gives AA 0*9801 : Aa 0*0198 : aa 
0*0001). Then matings involving aa individuals will 
be extremely rare. Segregating matings in which 
one parent is aa will be even rarer. Hence we may 
assume, in order to remove this uncertainty, that all 
families showing segregation for the recessive char- 
acter will be from matings of two heterozygotes, 
Aa X Aa. The error involved in this assumption 
will be small when p is small, but if p is large will 
become important. Fortunately most of the hered- 
itary human characters are rare conditions. 

The small size of the families has, in addition to its 
effect in invalidating as a test of significance, 
another troublesome effect. All matings of the type 
Aa X Aa will not give recessives among the progeny. 
Some will give all normals. These cannot then be 
distinguished from matings of which one parent was 
AA. How such families will be lost to the records. 
We must, then, have some procedure based solely on 
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families with one or more recessive aa children. 
Such families may be found by one of two procedures, 
or a mixture of both, viz. searching whole commun- 
ities or sections of communities for affected 
families or by finding recessive individuals and fol- 
lowing up their pedigrees. In the former case of 
4 complete ascertainment all segregating families, no 
matter how many recessives they may contain, will 
be included once. In the latter case of 4 ascertain- 
ment through affected individuals the chance of 
finding and recording the family is clearly proportional 
to the number of recessives in it. The investigator 
is twice as likely to meet one or other or both of two 
recessives as to meet a single individual. 

Various methods of handling data of these types 
have been suggested. Some, like the proband 
method, are solely of value in the case of complete 
ascertainment. In other cases they may give mis- 
leading results. Now complete ascertainment is 
difficult and rarely achieved. Consequently such 
methods are of little general value. 

Of all the methods the sib treatment is the most 
generally applicable. It takes into account the 
chance of ascertainment where this is through affected 
individuals, and is also applicable to completely ascer- 
tained data by a simple extension of the argument. 
There is a small loss of efficiency as compared with 
the proband method in this latter case, but much of 
the lost information can be recovered, if desired, by 
a more complicated analysis (see Fisher, 1934). 

25. THE SIB METHOD 

The logic of the sib method is simple. The chance 
of any sib of an affected individual being itself 
affected is independent of the affected nature of the 
first sib. Then by adding up the frequencies of the 
normal and affected individuals among the sibs of 
affected individuals a good estimate of the proportion 
of recessives emerging in segregating families will be 
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obtained. It is essential to the avoidance of biased 
results, when using this method, that any family be 
used as often as it is ascertained, because the method 
is based on sampling the sibs of affected members 
of the population and not on sampling the families 
with affected members. If the family is found by 
virtue of one affected sib it is used once. If it is 
ascertained through each of five affected members it 
must be entered in the records five times. This 
method then allows for or even demands that the 
frequency of ascertainment be proportional to the 
number of affected members it contains. Complete 
ascertainment is included in this scheme by con- 
sidering such families as though ascertained through 
each one of their affected members. 

The working of the method may be simply demon- 
strated, using families of three children. Such 
families, produced by the cross Aa x Aa, will con- 
tain 0 , 1, 2 or 3 aa children with the frequencies given 
by the expansion of (| + |) 3 . These are set out in 
the second column of Table 32. 


TABLE 32 


Affected sibs 


Sibship scores 

l family of three 

Frequency 

formal 

Affected 

0 

27 



64 



. 1 

27 

54r 

o 

64 

64 



9 

1 Sr 

IBr 

2 

64 

64 

64 


1 


6r 

3 

64 

0 

64 



72 r 

24r 

Total . 

. 1 

64 

64 


— 

__ 

— 


The first type of family with 0 affected children 
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cannot be ascertained as distinct from the progeny of 
other crosses, and so is omitted from further con- 
siderations. The next type of family will be ascer- 
tained via its single affected sib. Let us suppose 
that the chance of encountering any affected indi- 
vidual is r. Then such families will be found in r 
of cases and, as both the sibs of the affected child are 

27 54r 

normal, it will contribute — x ^ X 2, i.e. 7r ~ to the 

64 64 

column headed ‘ Normal ’ in that section of the table 
given to the sibship scores. There are. no affected 
sibs and consequently no contribution to the 
6 Affected 5 score. 

Families containing two affected individuals will 
have a chance 2r of entering the records. Each 
family contains one normal and one affected sib in 
addition to the one through which ascertainment was 
made. Hence the contribution to the Normal ’ and 

‘Affected ’ columns are both ^ x 2r X 1, i.e. — . 

64 64 

Similarly families with three affected children will 
be found in 3 r of cases and will contribute nothing 

1 6r 

to the 4 Normal ? column, but — x Sr X 2, i.e. 


64 


64 


to the 4 Affected 5 column. 

The sums of the 4 Normal 5 

are then found to be and . 

64 64 


and 4 Affected 5 columns 
24 r 

This is the typical 


3 : 1 ratio of a single factor F 2 . 

The method can be demonstrated in a similar 
manner for the general case of a family of size n and 
a segregation of x : y. 

It is clear that with a number of families this 
method can give a test of deviation from the expected 
ratio. The variation in the summed scores will 
depend on variations in family size, frequencies of 
possible tjrpes in any size of family, and on the fre- 
quency of ascertainment. All these will contribute 
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to the standard error of the score and must be 
employed in the test of significance. 

The formula for the variance of y , the proportion 
of recessives for any size of family, is 


1 y(i— y) 

n r s— 1 


(! + y' + 2yy'] 


where n’ is the total number of affected individuals 
ascertained, s is the family size and y f a measure of 
the completeness of ascertainment, y' is calculated 
from the formula 

_ S{t(t - 1 )n at } 
y 8{t(a-l)n at } 

where t is the number of ascertainments, a the number 
affected and n at the number of cases in class at, i.e. 
the class with a affected and ^ascertained (Pisher 1934) 
Ex. 18. To make this procedure clear we may 
take the question of the proportion of albinos in 
families segregating for this character. Is this pro- 
portion the 0-25 that would be expected if albinism 
is a simple autosomal recessive ? 

The following forty-seven families of five, six, and 
seven children were found by Pearson, Nettleship and 
Usher (1913) to be segregating various numbers of 
albinos as shown in Table 33. 


TABLE 33 


No. of Albinos 

5 

Size of Family 

6 

7 

1 

7 

4 

4 

2 

6 

6 

4 

3 

4 

3 

5 

4 

1 

1 

1 

Total 

. 18 

14 

15 


Consider first the families of five children. We 
may suppose complete ascertainment in this case, 
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in the absence of contradictory evidence. Then the 
data may be set down in the form of Table 34. 

TABLE 34 

Sibship scores 


No. of Albinos 

No. of Families 

Normal 

Affected 

1 

7 

28 

0 

2 

6 

36 

12 

3 

4 

12 

12 

4 

1 

4 

12 

Total 

. 18 

80 

36 

The families with 1 albino, 7 in number, will each 


contain 4 normal sibs of the albino and so will con- 
tribute 4 x 7 to the 4 Normal ’ column, and 0 to the 
4 Affected ’ column (r is assumed to be 1). 

The six families with 2 albinos have, for each 
albinotic child, 3 normal and 1 albinotic sibs. The 
families must be counted twice as we suppose them 
to have been found through each of the 2 albinos. 
Their contributions to the normal and affected scores 
will be 3 x 6 x 2 and 1x6x2 respectively. 

The entries for the remaining families with 3 or 4 
albinos may be calculated in a similar manner. On 
summing these columns we find 80 normal and 36 

36 

affected sibs. Then y = = 0-310346. 

oU -j- ot) 

Now y', the measure of completeness of ascertain- 
ment of affected children, is 1 as we have assumed 
complete ascertainment. The number of ascertain- 
ments, n', is here the total of affected children, i.e. 35, 
and s — 1 is 4. As we are testing agreement with 
the hypothesis of y = 0-25 we must use this value 
of y in calculating the variance (cf. the standard error 
of 3 : 1 ratios in Chapter II). 

Then V y = ~ L^J (1 + 1 + J) = 0-00334821 

and a y = VTV = 0-05787 
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Thus the deviation of the observed y from its 
expected 0-25 is 0*310345 — 0*25 or 0*060345 ± 
0*057873. Such a deviation is not significant. The 
families of 5 agree with the single factor hypothesis. 

The values of y and Y y for families of 6 and 7 are 
arrived at in the same way. They are set out in 
Table 35. 

TABLE 35 

Family Size V Vy <r y I y x 2 

5 . 0*310345 0*00334821 0*057873 298*7 1*088 

6 . 0*289655 0*00323276 0*056857 309*3 0-358 

7 . 0*274725 0*00252016 0*050201 396*8 0*243 

Mean 0*289910 0*0009952 0*03155 1004*8 1*165 


We have now three independent estimates of y, 
each with its own variance. A compound estimate 
of y may be obtained by finding the weighted mean 
of the three separate estimates, the weights being 
the amounts of information (i.e. reciprocals of 
variances) concerning the various estimates. 

Then y = where I Vl = y 

T7 1 

^ V ~Wy J 

Using these formulae we find from Table 35 
y = 0*28991 Vy = 0*0009952 o y = 0*03155 
Then the deviation of y from the expected 0*25 is 
0*03991 ± 0*03155 

This is not significant and so the data all agree with 
the Mendelian expectation. 

As we have at hand these three independently 
estimated values of y we may perform a simple test 
of heterogeneity. It must clearly he based on the 
differences between the estimates of y afforded by 
families of different sizes. 
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It will be remembered that, in Chapter II, we 
noted that the y 2 testing the deviation of a ratio 
from its expectation is given by 

X ~ V v 

where v is the expected value of y. 

' Applying this to the data from families of 5 indi- 
viduals we find 


r = 


(0*31035 - 0*25) 2 
0*0033482 


= 1*088 


This is entered in the last column of Table 35. The 
y 2 for families of 6 and 7 are found similarly. We 
also find and enter the y 2 from the weighted mean 
value of y. This last measures the joint deviation 
from the hypothesis. 

Then by adding the three ^ 2 s from the three family 
sizes and subtracting the y 2 of the joint estimate we 
obtain an analysis similar to that for ordinary 
genetical segregations. This analysis is : 


Deviation 

Heterogeneity 

Total 


TABLE 36 

x 2 D.f. P. 

. 1-165 1 0-30 - 0-20 
. 0-524 2 0-80 - 0-70 


1-689 3 


As no value of % 2 is significant it can be said that 
the data on albinotic children agree both with one 
another and with the Mendelian expectation of J 
affected in segregating families. 

The similarity of this method with those of Chapter 
II is obvious. Both tests involve the finding of a 
quantity and its variance. This is the material of 
the test of significance of the departure from expecta- 
tion either by the use of the standard error or y 2 . 
The difference of the two treatments, non-human and 
human, lies in the necessity for finding a new suit- 
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able quantity y, of known expectation, in the 
latter case. 

It may be mentioned that the sibship method is 
also valuable with other types of data. One other 
use, to which it has already been put, is the estima- 
tion of ovule sterility in Pisum, The pods correspond 
to the families in the above example. Fertile and 
sterile ovules may be recorded at harvest with the 
important limitation that pods having ho fertile 
ovules are lost as they fail to develop and drop from 
the plant. The problem is a replica of that worked 
out above and has been successfully treated by the 
sib method. 



CHAPTER X 
HUMAN GENETICS (II) 

26. LINKAGE 

T HE question of linkage detection and estimation 
from human pedigrees is complicated by the 
same two difficulties, incomplete knowledge of the 
mating and small families, as is the consideration of 
single factor segregation. 

Where there is knowledge of three or more genera- 
tions in the pedigree it is often possible to decide on 
the nature of the cross, and by lumping families from 
matings of the same type, to deal with the data by 
the methods adapted to the more usual types of 
genetical material. 

Ex. 19. Tor example, Haldane (1936) gives certain 
families segregating for retinitis pigmentosa, an eye 
defect. This anomaly is due to one of the so-called 
dominant genes, i.e. the heterozygote is the type 
usually distinguished from the normal. The problem 
at issue was that of whether the gene for retinitis 
pigmentosa was incompletely linked to sex or not. 

The informative families are those from the mating 
of affected men and normal women. The male 
parent is then heterozygous for both sex and the eye 
defect. It is, however, also necessary to know the 
phase of the linkage, whether coupling or repulsion. 
With a dominant gene such information is usually 
easy to obtain. These retinitis pigmentosa pedigrees 
give the required information in telling whether the 
man in question received the defect from his father 
no 
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will have been' trammittedtoMm ShhY^ 1, ^ 
some, and if from his mother with 2 v ? Chr ° m °- 

S Z Sw w 6 " tl1 “ S X 

. T ^ e results quoted by Haldane (l.c.) are from mat 
mgs of normal women and retinitis pigmentosa met 
and may be summarized as Table 37 ’ 


Coupling 

Repulsion 


TABLE 37 

Affected 
Males Females 


50 

30 


30 

31 


Normal 

Males Females Total 
27 26 133 

57 37 155 


The coupling data were from thirty-two families and 
he repulsion data from thirty-three families. The 
results are suggestive of sex linkage, but the single 

double W°t ^ S , 0me , what disturbed. As they are 
double backcross ; families (XYRr x XXrr) we may 

Sail 1116 We" t f n E ^ 7 : in 


Coupling . 

TABLE 38 

■&1 

57 

a s 

76 

133 

155 

Repulsion . 

. 88 

67 

Total 

. 145 

143 

288 


z 2 = 30-167 for one degree of freedom 

There can be no question of the significance of the 
evidence for linkage. 
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Proceeding to the estimation of p we find 


p* 


_ 57 X 67 
(I - p ) 2 ~~ 88 x 76’ 


i.e. p — 0*4304 


and 


% = y - )2 = 0-0293 


Haldane further considers the possibility of there 
being other and autosomal genes producing this eye 
defect, but we need not deal with this question in 
detail. It is sufficient to note that such genes will 
not affect the test of significance but may affect the 
estimate of p. 


27. THE U STATISTICS 

In addition to these families for retinitis pigmentosa 
Haldane also quotes families segregating for reces- 
sive characters which it is desired to test for incom- 
plete sex linkage. In these cases there is seldom any 
clue as to whether the heterozygous male parent 
received the gene for the defect from his father or 
his mother, i.e. whether he is doubly heterozygous in 
coupling or repulsion. Further, the small size of the 
human family seldom allows of the question being 
decided from the progeny of such ambiguous males. 
It is thus impossible to apply the same technique as 
to retinitis pigmentosa. Various other methods have 
been suggested for the solution of this problem, but 
the most generally efficient method is that of Fisher 
(1935a, 19356, 1936c). Certain quantities, denoted 
by u , are calculated and used as the basis of the 
decision. The precise formula for u varies with 
the type of family. For the double backcross 
(AaBb x aabb) we take 

Un = Oh a 2 — a 3 -f ^h ) 2 — Oh ~b ~b a z <h) 
for the single backcross (AaBb x Aabb) 

^ 31 = Oh 3a 2 — a 3 ~f3a 4 ) 2 — 9 a 2 "j~a 3 “j~ 9a 4 ) 

and for the F 2 (AaBb x AaBb) 

^33 ^ Oh — 3a 2 — 3a 3 -j- 9 a 4 ) 2 — (a 4 -j - 9 a 2 -j~ 9 a 3 -j- 81a 4 ) 
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The affinities of these formulae with those for the 
linkage of similar families are several. In fact 
%i = n (x 2 ~ 1)- The chief point to note about 
these u statistics is that, like £ 2 , they test equally 
well deviations from expectation whether in the 
direction indicating coupling or in the reverse way. 
In fact, it can be shown that u is a measure of 
p(l p). Since coupling families showing p recom- 
bination may be written as repulsion families showing 
1 ~p recombination, it is then clear that the value of u 
is independent of the phase of the linkage. We may 
take Haldane’s data on the segregation of the reces- 
sive achromatopsia to illustrate the use of the % 
statistics. 

Ex. 20. The twenty-eight families given by Hal- 
dane are set out in Table 39. The first four columns 
give the number of normal and affected males and 
females in the family and the fifth and sixth give the 
family size and the number of affected individuals. 
The seventh column gives the value of 

'^'3i == ( < ^i — 3(Z 2 — a 3 -j-3a4) 2 — (a 1 d~9a 2 4 _ $3‘4'9$4) 

for each family. We note that u 31 is the correct 
statistic to use as the cross is of normal male by 
normal female, i.e. XY Aa x XXAa, which is a 
single backcross. The 28 values of u 31 are summed 
and the sum is a measure of 1 — 4a: where 
x is an estimate of p(l — p). 

In order to obtain the actual value of 1 — 4x it is 
necessary to divide S(u n ) by a divisor, S(k), depend- 
ing on the method of ascertainment. The quantity 
1c is calculated for each family and its sum is the 
divisor. In the case of complete ascertainment, we 
take 

4 s — 3 s ~ 2 
h = s(s - D 1 r Z y- 

which is tabulated by Fisher (1935a). 
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TABLE 39 

Achromatopsia (Haldane’s data) 


Males 

Females 

s 

s 2 

«31 

K 


hi 

Normal 

Affected 

Normal 

Affected 

1 

3 

4 

0 

8 

3 

112 

61 '538 

84 

86-2 

1 

0 

2 

2 

5 

2 

4 

25-531 

36 

30-6 

2 

1 

3 

1 

7 

2 

- 22 

47-751 

66 

40-2 

2 

2 

3 

0 

7 

2 

26 

47-751 

66 

40-2 

1 

0 

1 

2 

4 

2 

16 

16-937 

24 

26-2 

2 

2 

1 

0 

5 

2 

4 

25-531 

36 

30-6 

0 

1 

1 

1 

3 

2 

- 18 

9-892 

14 

22-0 

3 

2 

1 

1 

7 

3 

- 30 

47-751 

66 

79-3 

2 

1 

0 

0 

3 

1 

- 10 

9-892 

14 

4-2 

0 

6 

2 

0 

8 

6 

344 

61-538 

84 

294-2 

2 

2 

1 

4 

9 

6 

- 8 

77-196 

104 

306-6 

1 

2 

0 

1 

4 

3 

- 24 

16-937 

24 

60-0 

1 

0 

3 

1 

5 

1 

- 12 

25-531 

36 

9-3 

1 

3 

0 

0 

4 

3 

36 

16-937 

24 

60-0 

0 

0 

0 

2 

2 

2 

18 

4-286 

6 

18-0 

2 

3 

1 

0 

6 

3 

34 

35-574 

50 

72-6 

0 

1 

0 

1 

2 

2 

- 18 

4-286 

6 

18-0 

2 

1 

0 

4 

7 

5 

74 

47-751 

66 

200-2 

0 

0 

0 

2 

2 

2 

18 

4-286 

6 

18-0 

3 

0 

2 

1 

6 

1 

2 ! 

35-574 | 

50 

12-2 

0 

1 

2 

0 

3 

1 

14 

9-892 

14 

4-2 

0 

2 

1 

0 

3 

2 

30 

9-892 

14 

22-0 

0 

0 

2 

1 

3 

1 

- 10 

9-892 

14 

4-2 

2 

2 

1 

1 

6 

3 

- 26 

35-574 

50 

72-6 

0 

0 

0 

2 

2. 

2 

18 

4-286 

6 

18-0 

1 

2 

0 

0 

3 

2 

6 

9-892 

14 

22-0 

0 

2 

4 

1 

7 

3 

18 

47-751 

66 

79-3 

5 

1 

3 

0 

9 

1 

-16 

77-196 

104 

22-2 

Total 34 

40 

38 

28 

140 68 

580 

826-845 

1,144 

|l,673-? 


For single ascertainment through affected indi- 
viduals 

h = (s- 1 )(s + 4) 

and is tabulated by Fisher (19356). 

It will he seen that S{Jc c ) is 826-845 and S(k s ) is 



1,144. 

ment 
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Therefore if we assume complete ascertain- 


l — 4x c = 


^(%i) 

S(k c ) 


5S0 

826*845 


0*7015 


Now if segregation of achromatopsia is independent 
of sex, i.e. p = 0*5, 1 — 4x should be 0. Hence the 
deviation from expectation is 0*7015. The variance 
of 1 — 4x is given by 

1 8 

= 0 ' 02177 

and (fue — VV 4X — 0*1476 


Hence the deviation is 4*753 times its standard error 
and must be considered to be highly significant (see 
Table I). 

Assuming single ascertainment we find in a similar 


manner : 

1 


SfcQ _ 580 _ 

8 3(h) H44 

a* = 0*1254 


0*5070 


Again the deviation (0*5070) is 4*043 times its standard 
error and is highly significant. 

It will be noticed that the complete ascertainment 
formulae give an apparently more highly significant 
result. It can be shown, however, that on taking an 
empirical test of significance, by basing the vari- 
ance on the observed distribution of the families 
and not taking their theoretical variance, the two 
methods give very nearly the same significance for 
the deviation. This test is fully described by Fisher 
(1936c), and need not be discussed in full here. 

We may next turn to a consideration of the 
estimate of p itself. Now 
l-4o;=l-4p(l-^)=l-4p+4p 2 
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Then l-%> a =Vr^=Vo-7016=0-8376 

or p c =8- 12 per cent 

Similarly l—2p s = Vo-7131 

or p 8 = 14*35 per cent. 

Although the two tests of significance based on 
1 — 4:X C and 1 — 4x s gave similar results when these 
quantities are used as the bases for estimating p they 
give very different answers. This is due to the dif- 
ferent assumptions, made about the method of ascer- 
tainment, giving very different expectations for the 
number of recessives in the families. Hence in the 
absence of precise knowledge as to the actual method 
of ascertainment employed we may do one of two 
things, (a) take that method of ascertainment whose 
expectation of recessives agrees best with the observed 
results or ( b ) employ a method, if one can be found, 
independent of the method of ascertainment. 

In the present case the first course is of little value 
as there is an excess of affected individuals even over 
the expectation of single ascertainment. Thus the 
second approach, that of finding a method inde- 
pendent of ascertainment, is to be preferred. 

The method of doing this has been worked out by 
Esher (1936c). The equation of estimation is still 

1_4 xi = ~ ~~ but now 

h = gK* + 9^)2 - (S X + 81s 2 )] 

where s x is the number of normals in the family and 
s 2 the number of affected individuals in the family. 
The value of h for various s x and «s 2 values are tabu- 
lated by Fisher (1936c) and also in Table IV at the 
end of this book. The values of ki for the present 
achromatopsia families are given in Table 39, 
column ten. 
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We then find : 

1 — = 


580 

1,673‘t 


■* = 0*3465 


1 - 2pi = 0*5886 

Pi == 20*57 per cent. 

This is looser linkage than that shown by either of 
the other methods, as might be expected in view of 
the excess of affected individuals. The important 
point is that it is trustworthy inasmuch as it is inde- 
pendent of the ascertainment. In general, unless the 
method of ascertainment is known with exactitude, 
the use of fa is preferable to the use of fa and fa. 
If the number of recessives agrees with complete or 
single ascertainment then fa will give an answer 
closely approximating the value obtained by the use 
of fa or fa. This is well demonstrated by various 
examples worked by Fisher (l.c.). It must be empha- 
sized that this fa is applicable only to the use of ti 21 . 

Returning to the general properties of u statistics 
it should be noted that whereas these statistics are 
fully efficient for the detection of linkage they are 
more or less inefficient for its estimation (Fisher, 
1935a). The loss of efficiency is, however, small for 
recombination values above 10 per cent, and only 
becomes considerable fox values below 5 per cent. 

The formulae for the calculation of 1 — 4a* from 
single backcross dataware used in the above example. 
The corresponding formulae for the F 2 and double 
backcross are : 

Double Backcross 


1 4z x — 


8(u h ) 

m 



where 

F 2 


h = s(s — 1) 


1 _ 4% = 


h = 2(s - 


S{Vu) XT ___ 

m 

4 s — 3 S ~ 2 
l)(s + 4) 4s _ 3* ~ 


where 
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28. LINKAGE DETECTION WHEN THE PARENTS ARE 
UNKNOWN 

So far we have considered the detection of linkage 
when both parents are known phenotypically, what- 
ever their genotype. If but one parent is known a 
modified u method may be used for the analysis of 
linkage (Fisher, 19356). But we can also detect 
linkage purely from a study of sibs when having no 
knowledge of their parents, as Penrose (1935) has 
shown. 

Let us consider pairs of sibs from families showing 
segregation for two autosomal characters A, a and 
R,b. For character A, a the two sibs may be alike 
(A and A or a and a) or different (a and A). Simi- 
larly they may be alike or unlike for character B,b. 

Taking the two characters together the pairs of 
sibs fall into four classes, as being like or unlike for 
A, a and like or unlike for B,b. The four classes 
will comprise : 




TABLE 40 




B,b like 

1. A A 

and 

B B 

2. A a 

and 

B B 

or a a 

or 

b b 

or a A 

or 

b b 

B,b unlike 

3. A A 

and 

B b 

4. A a 

and 

B b 

or a a 

or 

b B 

or a A 

or 

b B 


A, a like 


A, a unlike 


The frequency of these four classes should be in 
simple proportion if there is no linkage, but classes 
1 and 4 will be increased if linkage - is in fact present. 
This may be tested by the calculation of for a 
2x2 contingency table as used in a number of 
previous examples. 

If a family consists of more than two children it 
may be used as many times as pairs can be formed. 
For example, three children may be divided into and 
used as three pairs, four children into six pairs, &c. 

Ex. 21. Penrose (l.c.) reports fifty pairs of sibs 
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classified for blood group B as opposed to group 0 
and for blue eyes as opposed to not blue eyes. His 
results give a 2x2 table, of the form discussed 
above, thus : 

TABLE 41 


jtj 

Like Unlike 


Blue Like 

. 31 

2 

33 

Unlike . 

. 14 

3 

17 


45 

5 

50 


There is a suggestion of the classes 1 and 4 being in 
excess. Is this evidence for linkage significant '? 

It will be noticed that the expectation in two of 
the classes in this table is below 5 and so we cannot 
calculate yj^ without correction as it would over- 
emphasize discrepancies. This overemphasis results 
from the assumption of continuity in using the y- 
distribution, whereas actually the data are discon- 
tinuous. This error can, however, be materially 
reduced by using Yates’ (1931) Correction for Con- 
tinuity which consists of reducing the two high 
classes each by 0-5 and similarly increasing the two 
low classes. On doing this the table becomes : 

TABLE 42 


30*5 

2*5 

33 

14*5 

2*5 

17 

45 

5 

50 


y 2 may now be calculated by the usual formula and 
gives 

(30-5 x 2-5 - 14-5 x 2*5)50 A ^ 

45 X 5 X 17 x 33 °’ b3 

Such a y 2 for one degree of freedom has a proba- 
bility of between 0-5 and 0-3. The suggestion of 
linkage is not borne out by statistical analysis. 

Penrose notes that this method would, even in 
good circumstances, probably require nearly 100 
9 
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pairs to give a significant result. Hence it should 
not be used unless the parents cannot be obtained 
or else the collection of pairs of sibs is so much easier 
than the collection of whole families, that vastly 
increased numbers of observations can be made. 

It will be seen from this and the preceding chapter 
that the methods applicable to human data are 
related to those simpler methods in use for other 
genetical.data. They are more complex and often 
less efficient than the other methods because of the 
shortcomings of human data itself. These methods 
formulated for human data may also prove valuable 
in the analysis of data from other species in which 
the various complicating circumstances are encoun- 
tered. 



CHAPTER XI 
SYMBOLS AND FORMULAE 


Symbols 


a 

A i 

A 

D x 

f 


h 

k 

l 


m 

n 

V 

P 

s x 

S 


Vx 


X 


number observed in a class, 
sum of and a 4 in a four-class segregation, 
sum of cu and a s in a four-class segregation, 
deviation from zero of maximum likelihood 
expression of x. 

misclassification due to incomplete manif estation 
of a character. 

= ~n = V~ amoun * i n ^ orma tion concerning x 

per individual in a family, 
coefficient in orthogonal functions, 
the characteristic proportion in a two-class 
segregation which may be represented as l : 1. 


1 — x. 


proportion expected in any class, 
number of individuals in a family, 
recombination fraction. 

= p 2 or (1 — p) 2 in F 2 data, 
standard error of x (o x = s x when x is fixed by 
hypothesis). 

summation over all classes. 


variance of x = (s x ) 2 = — . 

ni x 

(i) 1 — y = chance of any individual being of 

a chosen type in a two-class segregation. 

(ii) Also used as p(l — p) in human data. 

191 
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(iii) Also used in the analysis of 2 by orthogonal 
functions. 

y proportion of recessives as found by the sib 
method. 


Binomial Expansion 

(x -f y) n = x n nx n ~ x y + n C 2 x n ~ 2 y 2 . . . nxy n ~ x + y n 



for a two-class segregation expected to be l : 1 

2 _ ( q l ~ 

% In 


Brandt and Snedecor formula for testing hetero- 
geneity 


7k* 




's(—\ - — 

l n J n t 


From 2x2 contingency table 

2 (#i&4 a 2 a 3 )^n 

1 ~ («i + a 2 )(a 3 + a 4 )(a 2 + a «)( a i + a s) 

For the detection of linkage between two factors 
segregating into Z x : 1 and l 2 : 1 respectively 


X 2 = 


(a 1 — Z 2 <^2 * ^1^3 ”f" I'll 2 ^ i) 

ljl 2 n 


Estimation 

Likelihood expression is 
%\ 

• • • 

Logarithm likelihood 

L = C + a ± log m 1 + a 2 log m 2 . . . 
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Equation of estimation by maximum likelihood 



to test heterogeneity between bodies of data. 
Product equation of estimation 

__ [-fuiiy efficient for linkage estimation] 
a 2 a 3 m 2 m 3 


Human Data 

1 y( 1 - y) 
Vy ~ n 1 5-1 


v y = -7 ~ — f (i + y' + 2 yy') where 

n 6 — i 

, m - 1)«4 
y 8{K 9 - 1 )%*} 

Uu = (^i — a 2 — a s -f- ^) 2 — (&i + ^2 + + 04) 

% 31 = (%— a 2 — 3a s +3a 4 ) 2 — (cq.4"&2~f9tf' 3 +9a 4 ) 

= (flq — 3$2 — 3C£-3-|“^4) 2 — 

1 ^- 4 ^ = [h depends on ascertainment, page 113 
S(fc) 

et seq.) 

V (1 ^ x) = ~ for %1 


for u zl 
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Special Formulae 

(j nx = i \f % 


1 : 1 ratio 
3 : 1 ratio 
15 : 1 ratio 
9 : 7 ratio 


x 2 = 


(fnx = | V 3% 

15tl 

O'wa: — rV^ 63'ft 




% 2 = 


(a i 

a 2 ) 2 


n 

(a i 

- 3 a 2 )a 


3 71 

(«i 

- 15a a ) 2 


15% 

(«i 

- ?« 2 ) 2 


Linkage between factors segregating in : 
1 : 1 and 1 : 1 (Backcross) — 

2 _ (<h — a 2 — a s + fl 4 ) 2 

^ -yj 


1> = F, 


J>(1- - J>) 


n n 

1 and 3 : 1 (Single backcross)- 


K 


Sa 3 + 3a 4 ) 2 


3 n 


* 2 = 


p given by solution of 


I w a i ^3 #4 

^ 1 + P 1 


2 — p 

= ff(l -ff)U + ff)(2 


= 0 


* 2w(l + 4p — 3 p 2 ) 

3 : 1 and 3 : 1 (F 2 ) 

( a x — 3 a 2 — 3 a z -f 9 a t 


* 2 =' 


9 n 


P = p 2 or (1 — ^>) 2 is given by the solution of 
nP 2 — (ax — 2a 2 — 2a 3 — a 4 j P — 2a 4 = 0 
^ = 2P(1-P)(2 + P) 


n(l + 2P) 

(1 - P)(2 + P) 
2n{l + 2P) 
Product Formulae in F 2 


and F« 


where P = ■p 2 or (1 — p) 2 


P is the solution of 

(a x a 4 — a 2 a 3 )P 2 — 2(a x a 4 -f a 2 a 3 )P + a x a 4 = 0 
2P(1 - P)(2 + P) 

92,(1 -f 2P) 


V P 
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( Reprinted by kind permission of Messrs. Oliver and Boyd) 
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For larger values of n, the expression V2 X * - V 2 n - 1 may be used as a normal deviate with unit standard error. 
{Reprinted by hind permission of Messrs. Oliver and Boyd) 
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TABLE III 


Fraction 

expected 

0-900 

0-950 

Level of Probability 

0-980 0-990 0-995 

0-998 

0*999 

i 

3*3 

4-3 

5-6 

6-6 

7-6 

9-0 

10-0 

i 

8-1 

10-4 

13-6 

16*0 

184 

21-6 

24-0 

i. 

17-2 

22-4 

29-3 

34-5 

39-7 

46-5 

51-7 

T 1 ? 

35-6 

46-3 

60-5 

71-2 

81-9 

96-0 

106-8 

it 

73-0 

95-0 

124-0 

146-0 

168-0 

197-0 

219-0 

1 

147-1 

191-3 

249-9 

296-1 

338-4 

396-9 

441*2 

i 

5-7 

7-4 

9-7 

11-4 

13-1 

15-3 

17-0 

X 

19-5 

25-4 

33-2 

39-1 

44-9 

52-7 

58-6 

9 

1 

2 7 

61-0 

79-3 

103-6 

122-0 

140-3 

164-6 

182-9 


The numbers in the body of the table are the numbers of 
individuals which should he raised, in a progeny, in order 
that a certain type, expected to form a known fraction of 
the progeny, may be expected to occur, with a chosen level 
of probability, at least once. For example, suppose on 
selfing a plant heterozygous for one gene (i.e. Aa) we want 
to raise a family sufficiently large to contain at least one 
recessive (aa) in 99 cases out of 100. Reeessives types are 
expected in J of the cases. Then taking the second row of 
the table (the £ row) and the fourth column (probability 
0-99) we find that sixteen plants are needed. 



128 MEASUREMENT OF LINKAGE IN HEREDITY 


TABLE IV (Fisher 1936a) 

Table of hi [= |( 5 i + ~ (*T + 81*2)] ^se with 

u n 

s x = number of normal children 
s 2 = number of affected children 



$ 2 

1 

2 

3 

4 

5 

G 

7 

8 

9 

§i 0 


18*0 

54*0 

108*0 

180*0 

270*0 

378*0 

504-0 

648*0 

1 

2-0 

22*0 

60*0 

116*0 

190*0 

282*0 

392*0 

520*0 


2 

4*2 

26-2 

66*2 

124-2 

200-2 

294-2 

406-2 

536-2 


3 

6*6 

30-6 

72-6 

132-6 

210-6 

306-6 

420-6 

552-6 


4 

9*3 

35-3 

79*3 

141-3 

221-3 

319-3 

435-3 



5 

12-2 

40-2 

86-2 

150-2 

232-2 

332-2 

450-2 


6 

15*3 

45-3 

93*3 

159-3 

243-3 

345-3 

465-3 


7 

18-6 

50-6 

100*6 

168-6 

254-6 

358-6 



8 

22-2 

56*2 

108*2 

178-2 

266-2 

372-2 



9 ' 

26-0 

62-0 

116-0 

188-0 

278-0 

386-0 



10 

30-0 

68-0 

124-0 

198-0 

290-0 




11 

34-2 

74-2 

132-2 

208-2 

302-2 




12 

38-6 

806 

140-6 

218-6 

314-6 




13 

43-3 

87-3 

149-3 

229-3 





14 

48-2 

94-2 

158-2 

240-2 





15 

53-3 

101*3 

167-3 

251-3 





16 

58-6 

108*6 

176-6 
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Contingency table, 88, 118 
Continuity, correction for, 
119 

Continuous distribution, 10 
Coupling, 4, 61, 92, 113 

Degrees of freedom, 16, 33 

— loss of, 22, 54 
Deviation, 69, 77 
Diploid, 2 
Dominance, 60, 110 
Drosophila melanogaster , 2 
Duplicate factors, 41 


Efficiency, criterion of, 45 

— of statistics, 51 et seq., 92, 

. 117 

Estimation, combined, 69 et 
seq . 

— criteria of, 44 et seq. 

— equation of, 46 

— inefficient, 54 

— method of, 7, 11, 44 
Eye colour, 119 

F 2 , 3, 36, 45, 57, 79, 112, 117 

— disturbed, 87 

Family size, 26 et seq., 101 et 
seq. 

Fisher, 17, 24, 45, 51, 54, 92, 
95 et seq., 112 et seq. 
Frequency distribution, 8 

Grouse locusts, see Acridium 

Haldane, 52, 110 
Heterogeneity, of data, 17 et 
- seq., 37, 73 et seq. 
Heterozygote, 3, 26 
Hierarchical tables, 20 
Homozygote, 26 
Human data, 100 
Hutchinson, 64 

Imai, 36, 48 
Immer, 54, 55 
Information, 53, 72, 77, 107 

— calculation of, 57 et seq. 
Interpolation, solution by, 71 

Jenkins, 78, 88 
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Lethal factors, 3 
Likelihood, 46, 70 

— logarithm of, 46, 70 
Linear functions, 39 
Linkage, 4, 7, 44 

— with sex, 110 

Maize, 78, 88 

Manifestation, of characters, 
84, 91 

Matings, human, 101 
Mather, 17, 57 et seq., 69 et 
seq. 

Maximization, 46, 70 
Maximum Likelihood, method 
of, 11, 45 et seq., 77, 87, 
91 

Mean information, 73 

— weighted, 107 
Mice, 17, 22 
Misclassification, 27, 29 
Multinomial distribution, 1 1 

Nabours, 93 
Nettleship, 105 
Normal deviates, 14, 74 

— distribution, 10 

Orthogonal functions, 32 et 
seq., 38 et seq. 

— derivation of, 40 
Ovule sterility, 109 

Parameter, 7, 10, 44 
Pearson, 15, 105 
Penrose, 118 
Pharbitis , 36, 48 
Philp, 34, 47 
Pisum, 109 
Precision, 12, 28 
Primula sinensis, 2, 52 
Proband method, 102 


Product formula, 54, 90, 96 

— tables of, 55 
Poppy, 34, 47 

Recessives, human, 112 
Recombination, 4, 5, 44 
Repulsion, 4, 61, 92, 113 
Retinitis pigmentosum, 110 

Sansome, 69 
Segregation, 2, 29, 69 
Separation, index of, 69 
Sib method, 102 et seq. 
Significance, test of, 5, 9 

— meaning of, 6 
Simplex autotetraploid, 69 
Single factors, 2, 1 3 

— . in humans, 100 
Smith, 84 
Snedecor, 22, 38 
Standard error, 10, 13, 48, 50, 
73, 105, 115 
Statistic, 10, 11 
Sufficiency, criterion of, 45 
Sweet Pea, 73 

Tetrad analysis, 3, 4 
Tomato, 70 

u statistics, 112 
Usher, 105 

Variance, 12, 39, 48, 50, 72, 
92, 96 et seq., 105, 115 
Viability, 91, 93 

Weights, see Mean 
Wheat, 84 
Winton, De, 52 

Yates, 119 




